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Abstract 

We address the issue of incorporating a particular yet expressive form of integrity constraints (namely, denial constraints) into 
probabilistic databases. To this aim, we move away from the common way of giving semantics to probabilistic databases, which 
relies on considering a unique interpretation of the data, and address two fundamental problems: consistency checking and query 
evaluation. The former consists in verifying whether there is an interpretation which conforms to both the marginal probabilities 
of the tuples and the integrity constraints. The latter is the problem of answering queries under a "cautious" paradigm, taking 
into account all interpretations of the data in accordance with the constraints. In this setting, we investigate the complexity of the 
above-mentioned problems, and identify several tractable cases of practical relevance. 

Keywords: Probabilistic databases, Integrity constraints, Consistency checking 
1. Introduction 

Probabilistic databases (PDBs) are widely used to represent uncertain information in several contexts, ranging 
from data collected from sensor networks, data integration from heterogeneous sources, bio-medical data, and, more 
in general, data resulting from statistical analyses. In this setting, several relevant results have been obtained re- 
garding the evaluation of conjunctive queries, thanks to the definition of probabilistic frameworks dealing with two 
substantially different scenarios: the case of tuple-independent PDBs 111 lll24ll . where all the tuples of the database are 
considered independent one from another, and the case of PDBs representing probabilistic networks encoding even 
complex forms of correlations among the data |5H- However, none of these frameworks takes into account integrity 
constraints in the same way as it happens in the deterministic setting, where constraints are used to enforce the con- 
sistency of the data. In fact, the former framework strongly relies on the independence assumption (which clearly is 
in contrast with the presence of the correlations entailed by integrity constraints). The latter framework is closer to an 
AI perspective of representing the information, as it requires the correlations among the data to be represented as data 
themselves. This is different from the DB perspective, where constraints are part of the schema, and not of the data. 

In this paper, we address the issue of incorporating integrity constraints into probabilistic databases, with the aim 
of extending the classical semantics and usage of integrity constraints of the deterministic setting to the probabilistic 
one. Specifically, we consider one of the most popular logical models for the probabilistic data, where information 
is represented into tuples associated with probabilities, and give the possibility of imposing denial constraints on the 
data, i.e., constraints forbidding the co-existence of certain tuples. In our framework, the role of integrity constraints 
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is the same as in the deterministic setting: they can be used to decide whether a new tuple can be inserted in the 
database, or to decide (a posteriori w.r.t. the generation of the data) if the data are consistent. 

Before explaining in detail the main contribution of our work, we provide a motivating example, which clarifies the 
impact of augmenting a PDB with (denial) constraints. In particular, we focus on the implications on the consistency 
of the probabilistic data, and on the evaluation of queries. We assume that the reader is acquainted with the data 
representation model where uncertainty is represented by associating tuples with a probability, and with the notion of 
possible world, (however, these concepts will be formally recalled in the first sections of the paper). 

Motivating Example 

Consider the PDB schema T> p consisting of the relation schema Room 1 ' (Id, Hid, Price, Type, View, P), and its 
instance room? in Figure Q] 
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Figure 1 . Relation instance room 1 ' 



Every tuple in room? is characterized by the room identifier Id, the identifier Hid of the hotel owning the room, 
its price per night, its type (e.g., "Standard", "Suite"), and the attribute View describing the room view. The attribute 
P specifies the probability that the tuple is true. For now, we leave the probabilities of the three tuples as parameters 
(Pi, P2, Pi), as we will consider different values to better explain the main issues related to the consistency and the 
query evaluation. 

Assume that the following constraint ic is defined over DP: "in the same hotel, standard rooms cannot be more 
expensive than suites". This is a denial constraint, as it forbids the coexistence of tuples not satisfying the specified 
property. In particular, ic entails that t\ and are mutually exclusive, as, according to t\, the standard room 1 would 
be more expensive than the suite room 2 belonging to the same hotel as room 1 . For the same reason, ic forbids the 
coexistence of ti and £3. 

Finally, consider the following query q on T> p : "Are there two standard rooms with sea view in hotel 1 ?". We now 
show how the consistency of the database and the answer of q vary when changing the probabilities of roomP's tuples. 
Case 1 (No admissible interpretation): p\ - |; p2 = A; pi = A. 

In this case, we can conclude that the database is inconsistent. In fact, ic forbids the coexistence of t\ and t%, which 
means that the possible worlds containing t\ must be distinct from those containing t%. But the marginal probabilities 
of t\ and tj do not allow this: the fact that pi = | and pi — \ implies that the sum of the probabilities of the worlds 
containing either t \ or t2 would be f + A, which is greater than 1 . 

Case 2 (Unique admissible interpretation): p\ = \\ P2 = k> Pi — \- 

In this case, the database is consistent, as it represents two possible worlds: W\ = \t\,h} and W2 = fe}> both with 
probability | (correspondingly, the possible worlds representing the other subsets of {t\,t2,h} have probability 0). 
Observe that there is no other way to interpret the database, while making the constraint satisfied in each possible 
world, and the probabilities of the possible worlds compatible w.r.t. the marginal probabilities of t\, ?2, h. Thus, the 
database is consistent and has a unique admissible interpretation. 

Now, evaluating the above-defined query q over all the admissible interpretations of the database yields the answer 
true with probability j (which is the probability of w\, the only non-zero-probability world, in the unique admissible 
interpretation, where q evaluates to true). Note that, if ic were disregarded and q were evaluated using the indepen- 
dence assumption, the answer of q would be true with probability j. 

Case 3 (Multiple admissible interpretations): p\ — A; p2 — A; pi = A. 

In this case, we can conclude that the database is consistent, as it admits at least the interpretations I\ and I2 repre- 
sented in the two rows of the following table (each cell is the probability of the possible world reported in the column 
header). 
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With a little effort, the reader can check that there are infinitely many ways of interpreting the database while 
satisfying the constraints: each interpretation can be obtained by assigning to the possible world {t\,ts} a different 
probability in the range [t, |], and then suitably modifying the probabilities of the other possible worlds where ic is 
satisfied. Basically, the interpretations I\ and I2 correspond to the two extreme possible scenarios where, compatibly 
with the integrity constraint ic, a strong negative or positive correlation exists between t \ and £3. The other interpreta- 
tions correspond to scenarios where an "intermediate" correlation exists between t\ and t^. Thus, differently from the 
previous case, there is now more than one admissible interpretation for the database. 

Observe that, in the absence of any additional information about the actual correlation among the tuples of room 1 *, all 
of the above-described admissible interpretations are equally reasonable. Hence, when evaluating queries, we use a 
"cautious" paradigm, where all the admissible interpretations are taken into account - meaning that no assumption on 
the actual correlations among tuples is made, besides those which are derivable from the integrity constraints. Thus, 
according to this paradigm, the answer of query q is true with a probability range [5,5] (where the boundaries of this 
range are the overall probabilities assigned to the possible worlds containing both t\ and ?3 by /; and li). As pointed 
out in the discussion of Case 2, if the independence assumption were adopted (and ic disregarded), the answer of q 
would be true with probability j, which is the left boundary of the probability range got as cautious answer. 

Main contribution 

We address the following two fundamental problems: 

1) Consistency checking: the problem of deciding the consistency of a PDB w.r.t. a given set of denial constraints, 
that is deciding if there is at least one admissible interpretation of the data. This problem naturally arises 
when integrity constraints are considered over PDBs: the information encoded in the data (which are typically 
uncertain) may be in contrast with the information encoded in the constraints (which are typically certain, as 
they express well-established knowledge about the data domain). Hence, detecting possible inconsistencies 
arising from the co-existence of certain and uncertain information is relevant in several contexts, such as query 
evaluation, data cleaning and repairing. 

In this regard, our contribution consists in a thorough characterization of the complexity of this problem. Specif- 
ically, after noticing that, in the general case, this problem is /VP-complete (owing to its interconnection to the 
probabilistic version of SAT), we identify several islands of tractability, which hold when either: 

/) the conflict hypergraph (i.e., the hypergraph whose edges are the sets of tuples which can not coexist 
according to the constraints) has some structural property (namely, it is a hypertree or a ring), or 

ii) the constraints have some syntactic properties (independently from the shape of the conflict hypergraph). 

2) Query evaluation: the problem of evaluating queries over a database which is consistent w.r.t. a given set of 
denial constraints. Query evaluation relies on the "cautious" paradigm described in Case 3 of the motivating 
example above, which takes into account all the possible ways of interpreting the data in accordance with the 
constraints. Specifically, query answers consist of pairs (t, r p ), where t is a tuple and r p a range of probabilities. 
Therein, r p is the narrowest interval containing all the probabilities which would be obtained for t as an answer 
of the query when considering all the admissible interpretations of the data (and, thus, all the correlations among 
the data compatible with the constraints). 

For this problem, we address both its decisional and search versions, studying the sensitivity of their complexity 
to the specific constraints imposed on the data and the characteristics of the query. We show that, in the case 
of general conjunctive queries, the query evaluation problem is PP M> ' 1 ° s " ] -hard and in FP^ (note that FP NP 
is contained in #P, the class for which the query evaluation problem under the independence assumption is 
complete). Moreover, we identify tractable cases where the query evaluation problem is in PTIME, which 
depend on the characteristics of the query and, analogously to the case of the consistency checking problem, on 
either the syntactic form of the constraints or on some structural properties of the conflict hypergraph. 
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Moreover, we consider the following extensions of the framework and discuss their impact on the above-summarized 
results: 

A) tuples are associated with probability ranges, rather than single probabilities: this is useful when the data 
acquisition process is not able to assign a precise probability value to the tuples 113 ll 13511 : 

B) also denial constraints are probabilistic: this allows also the domain knowledge encoded by the constraints to 
be taken into account as uncertain; 

C) pairs of tuples are considered independent unless this contradicts the constraints: this is a way of interpreting 
the data in between adopting tuple-independence and rejecting it, and is well suited for those cases where one 
finds it reasonable to assume some groups of tuples as independent from one another. For instance, if we 
consider further tuples pertaining to a different hotel in the introductory example (where constraints involve 
tuples over the same hotel), it may be reasonable to assume that these tuples encode events independent from 
those pertaining hotel 1 . 



2. Fundamental notions 

2.1. Deterministic Databases and Constraints 

We assume classical notions of database schema, relation schema, and relation instance. Relation schemas will be 
represented by sorted predicates of the form R(A\, . . . , A„), where R is said to be the name of the relation schema and 
A\,...,A„ are attribute names composing the set denoted as Attr(R). A tuple over a relation schema R(A\, . . . , A„) is 
a member of Ai X • • • X A„, where each A; is the domain of attribute A, (with i e [l..n]). A relation instance of R is a 
set r of tuples over R. A database schema D is a set of relation schemas, and a database instance D of T> is a set of 
relation instances of the relation schemas of T). Given a tuple f, the value of attribute A of t will be denoted as t[A]. 

A denial constraint over a database schema 3D is of the form Vx*.-i[R\(x\) A • ■ • A R m (x m ) A (p(x)], where: 

- Ri , . . . , R m are name of relations in T>; 

- x is a tuple of variables and x\ , . . . , x,„ are tuples of variables and constants such that x = Var(xi) U • • • U Var(x,„), 
where Var(i ? ,) denotes the set of variables in xc 

- <p(x) is a conjunction of built-in predicates of the form x o y, where x and y are either variables in x or constants, and 
o is a comparison operator in {=, +, <, >, <, >}. 

m is said to be the arity of the constraint. Denial constraints of arity 2 are said to be binary. For the sake of brevity, 
constraints will be written in the form: -i[R\(x*{) A • • ■ A R m (x m ) A (p(x)], thus omitting the quantification Vx 

We say that a denial constraint ic is join-free if no variable occurs in two distinct relation atoms of ic, and, for 
each built-in predicate occurring in (p, at least one term is a constant. Observe that join-free constraints allow multiple 
occurrences of the same relation name. 

It is worth noting that denial constraints enable equality generating dependencies (EGDs) to be expressed: an EGD 
is a denial constraint where all the conjuncts of <p are not-equal predicates. Obviously, this means that a denial con- 
straints enables also a functional dependency (FD) to be expressed, as an FD is a binary EGDs over a unique relation 
(when referring to FDs, we consider also non-canonical ones, i.e., FDs whose RHSs contain multiple attributes). 

Given an instance D of the database schema T> and an integrity constraint ic over T>, the fact that D satisfies (resp., 
does not satify) ic is denoted as D \= ic (resp., D \£ ic) and is defined in the standard way. D is said to be consistent 
w.r.t. a set of integrity constraints IC, denoted with D |= IC, iff Vz'c 6 IC D \= ic . 

Example 1. Let T) be the {deterministic) database schema consisting of the relation schema Room(Id, Hid, Price, 
Type, View), obtained by removing the probability attribute from the relation schema of our motivating example. 
Assume the following denial constraints over D: 

ic: -i[Room(xi, X2, X3, 'Std',xi)A Room(x5, X2, x&, 'Suite' , Xj) A X3 > x&], saying that, in the same hotel, there can 

not be standard rooms more expensive than suites; 
ic': — 1 [Room(x 1 , X2, x-$, x\, x$)A Room(xg, xi, Xj, X4, x%) A X3 + xj], imposing that rooms of the same type and hotel 

have the same price. Thus, ic' is the FD: Hid, Type— > Price. 
where ic is the constraint presented in the introductory example. Consider the relation instance room of Room, 
obtained from the instance room p of the motivating example by removing column P. It is easy to see that room 
satisfies ic', but does not satisfy ic, since, for the same hotel, the price of standard rooms ( rooms 1 and 3 ) is greater 
than that of suite room 2. □ 
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2.2. Hypergraphs and hypertrees 

A hypergraph is a pair H = (N, E), where N is a set of nodes, and E a set of subsets of N, called the hyperedges of 
H. The sets N and E will be also denoted as N(H) and E(H), respectively. Hypergraphs generalize graphs, as graphs 
are hypergraphs whose hyperedges have exactly two elements (and are called edges). Examples of hypergraphs are 
depicted in Figure [2] 

Given a hypergraph H = (N, E) and a pair of its nodes n\, n^, a path connecting n\ and «2 is a sequence e\, . . ., 
e m of distinct hyperedges of H (with m > 1) such that n\ e e\, n<i e e m and, for each i e [l..m - 1], e, n e,+i 0. A 
path connecting n; and «2 is said to be trivial if m = 1, that is, if it consists of a single edge containing both nodes. 

Let % = e\, . . . , e m be a sequence of hyperedges. We say that e, and ej are neighbors if j = i + 1, or i = m and 
j = 1 (or: if i = j + 1, or i = 1 and j = m). The sequence % is said to be a n'ng if: z) m > 3; fi) for each pair e,-, e ; 
(z ^ /), it holds that e, flcj ^ if and only if e,- and ej are neighbors. An example of ring is depicted in Figure 0b). It 
is easy to see that the definition of ring collapses to the definition of cycle in the case that the hypergraph is a graph. 

The nodes appearing in a unique edge will be said to be ears of that edge. The set of ears of an edge e will be 
denoted as ears(e). For instance, in Figure|2a), ears{e\) = fa) and ears(e{) = 0. 

A set of nodes N' of H is said to be an edge-equivalent set if all the nodes in N' appear altogether in the edges of 
H. That is, for each e e E such that e n N' + 0, it holds that e n N' — N'. Equivalently, the nodes in N' are said to 
be edge-equivalent. For instance, in the hypergraph of Figure EJb), {/i,?2} is an edge-equivalent set, as both t\ and ?2 
belong to the edges e\, e2 only. Analogously, in the hypergraph of Figure |2]c), nodes ?2 and £3 are edge equivalent, 
while {?2, ?3, tt) is not an edge-equivalent set. Observe that sets of nodes which do not belong to any edge, as well as 
the ears of an edge (which belong to one edge only), are particular cases of edge-equivalent sets. 

A hypergraph is said to be connected iff, for each pair of its nodes, a path connects them. A hypergraph H 
is a hypertree iff it is connected and it satisfies the following acyclicity property: there is no pair of edges e\, e2 
such that removing the nodes composing their intersection from every edge of H results in a new hypergraph where 
the remaining nodes of e\ are still connected to the remaining nodes of eo. An instance of hypertree is depicted 
in Figure |3Jc). Observe that the hypergraph in Figure |2J a ) is not a hypertree, as the nodes t<i and ?6 of e\ and <?2, 
respectively, are still connected (through the path e\, e^, e%) even if we remove node t\, which is shared by e\ and <?2- 
It is easy to see that hypertrees generalize trees. Basically, the acyclicity property of hypertrees used in this paper is 
the well-known y-acyclicity property introduced in lll6ll . In [J15, 16], polynomial time algorithms for checking that a 
hypergraph is y-acyclic (and thus a hypertree) are provided. 



3. PDBs under integrity constraints 

3.1. Probabilistic Databases (PDBs) 

A probabilistic relation schema is a classical relation schema with a distinguished attribute P, called probability, 
whose domain is the real interval [0, 1] and which functionally depends on the set of the other attributes. Hence, 
a probabilistic relation schema has the form R P (A\, . . . ,A n , P). A PDB schema DP is a set of probabilistic relation 
schemas. A probabilistic relation instance r p is an instance of R p and a PDB instance D p is an instance of DP. We 
use the superscript p to denote probabilistic relation and database schemas, and their instances. For a tuple t e D p , 
the value t [P] is the probability that t belongs to the real world. We also denote t [P] as p(f). 

Given a probabilistic relation schema R p (resp., relation instance r p , probabilistic tuple f), we write det(R p ) (resp., 
det(r p ), det(t)) to denote its "deterministic" part. Hence, given R P {A\, . . . , A„, P), det(R p ) = R{A\, . . . ,A„), and 
det(r p ) = n Attr (det(Rp)){r p ), and det(t) = 7iAttr(det(Ri>)){t)- This definition is extended to deal with the deterministic part of 
PDB schemas and instances in the obvious way. 
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3.1.1. Possible world semantics 

The semantics of a PDB is based on possible worlds. Given a PDB D'\ a possible world is any subset of its 
deterministic part det(D p ). The set of possible worlds of D p is as follows: pwd(D p ) = {w \ w c det(D p )}. An 
Pr interpretation of D p is a probability distribution function (PDF) over the set of possible worlds pwd(D p ) which 
satisfies the following property: 

(0 Vr € D p , p(t) = V Pr(w). 

w E pwd(Di') 
A del(l) £ w 

Condition (/) imposes that the probability of each tuple t of D p coincides with that specified in t itself. Observe 
that, from definition of PDF, Pr must also satisfy the following conditions: 

(ii) Y Pr(w) = \; (Hi) Vw e pwd(D p ), Pr(w) > 0; 

w £ pwd(D>') 

meaning that Pr assigns a non-negative probability to each possible world, and that the probabilities assigned by Pr 
to the possible worlds sum up to 1 . 

The set of interpretations of a PDB D p will be denoted as I(D P ). 

Observe that, strictly speaking, possible worlds are sets of deterministic counterparts of probabilistic tuples. How- 
ever, for the sake of simplicity, with a little abuse of notation, in the following we will say that a probabilistic tuple 
t belongs (resp., does not belong) to a possible world w - written tew (resp., t t w) - if w contains (resp., does not 
contain) the deterministic counterpart of t, i.e., det(t) e w (resp., det(t) i w). Moreover, given a deterministic tuple f, 
we will write p(t) to denote the probability associated with the probabilistic counterpart of t. Thus, pit) will denote 
either t[P], in the case that t is a probabilistic tuple, or t'[P], in the case that t is deterministic and t' is its probabilistic 
counterpart. 

If independence among tuples is assumed, only one interpretation of D p is considered, assigning to each possible 
world w the probability Pr(w)-\\ tew p(t) x Y\t<t.w(^~P(t))- in fact, under the independence assumption, the probability 
of a conjunct of events is equal to the product of their probabilities. In turn, queries over the PDB are evaluated by 
considering this unique interpretation. In this paper, we consider a different framework, where independence among 
tuples is not assumed, and all the possible interpretations are considered. 

Example 2. Consider the PDB schema D p and its instance D p introduced in our motivating example. D p consists 
of the relation instance room /J reported in Figure\l\ Assume that t\, ti, £3 have probabilities p\= pi = Pi = 1/2, and 
disregard the integrity constraint defined in the motivating example. 

Table Q] shows some interpretations of D p . Pr\ corresponds to the interpretation obtained by assuming tuple 
independence. Interpretation Pr*,, where e is any real number in [0, 1 /4], suffices to show that there are infinitely 
many interpretations of D p . □ 
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Table 1 . Some interpretations of D 1 ' 



3.2. Imposing denial constraints over PDBs 

An integrity constraint over a PDB schema D p is written as an integrity constraint over its deterministic part 
det(D p ). Its impact on the semantics of the instances of T> p is as follows. As explained in the previous section, a 
PDB D p , instance of D p , may have several interpretations, all equally sound. However, if some constraints are known 

6 



5. Flesca, F. Furfaro, F. Parisi / submitted to Journal of Computer and System Sciences 00 (2013) I A48\ 



7 



on its schema £)'', some interpretations may have to be rejected. The interpretations to be discarded are those "in 
contrast" with the domain knowledge expressed by the constraints, that is, those assigning a non-zero probability to 
worlds violating some constraint. 

Formally, given a set of constraints IC on £)'\ an interpretation Pr e I(D P ) is admissible (and said to be a model 
for D p w.r.t. IC) if I lw e pw d(Df)Aw\=XC Pr ( w ) = 1 (or, equivalently, if £ we/nwi ( DP ) Avt# jc Pr ( w ) = 0). The set of models of 
D p w.r.t. IC will be denoted as M(D P , IC). Obviously, M(D P , IC) coincides with the set of interpretations I(D P ) if 
no integrity constraint is imposed (IC = 0), while, in general, M(D P , IC) £ I(D P ). 

Example 3. Consider the PDB D p and the integrity constraint ic introduced in our motivating example. Assume that 
all the tuples of room'' have probability 1 /2. Thus, the interpretations for D p are those discussed in Example\2\(see 
also Table\l}. It is easy to see that room'' admits at least one model, namely Prj, (shown in Table\l}, which assigns 
non-zero probability only to w\ — {?2} and W2 — {fi,?3}- In fact, it can be proved that Pr^ is the unique model of 
room'' w.r.t. ic, since every other interpretation of room'', including Pr\ where tuple independence is assumed, makes 
the constraint ic violated in some non-zero probability world. This example shows an interesting aspect of denial 
constraints. Although denial constraints only explicitly forbid the co-existence of tuples, they may implicitly entail the 
co-existence of tuples: for instance, for the given probabilities of t\, t% tj, constraint ic implies the coexistence of t\ 
and ?3. □ 

Example [3]re-examines Case 2 of our motivating example, and shows a case where the PDB is consistent and ad- 
mits a unique model. The reader is referred to the discussions of Case 1 and Case 3 of the motivating example to con- 
sider different scenarios, where the PDB is not consistent (Case 1), or is consistent and admits several models (Case 3). 

3.2.1. Modeling denial constraints as hypergraphs 

Basically, a denial constraint over a PDB restricts its models w.r.t. the set of interpretations, as it expresses the fact 
that some sets of tuples of D p are conflicting, that is, they cannot co-exist: an interpretation is not a model if it assigns 
a non-zero probability to a possible world containing these tuples altogether. Hence, a set of denial constraints IC can 
be naturally represented as a conflict hypergraph, whose nodes are the tuples of D p and where each hyperedge consists 
of a set of tuples whose co-existence is forbidden by a denial constraint in IC (in fact, hypergraphs were used to model 
denial constraints also in several works dealing with consistent query answers in the deterministic setting [jit). The 
definitions of conflicting tuples and conflict hypergraph are as follows. 

Definition 1 (Conflicting set of tuples). Let D p be a PDB schema, IC a set of denial constraints on T> p , and D p an 
instance of D p . A set T of tuples of D p is said to be a conflicting set w.r.t. IC if it is a minimal set such that any 
possible world containing all the tuples in T violates IC. 

Example 4. In Example\3\ both [t\, ?2) and {?2, h} are conflicting sets of tuples w.r.t. IC = {ic}, while {h,tz, h] is not, 
as it is not minimal. □ 

Definition 2 (Conflict hypergraph). Let T) p be a PDB schema, IC a set of denial constraints on T> p , and D p an 
instance ofT> p . The conflict hypergraph of D p w.r.t. IC is the hypergraph HG(D P , IC) whose nodes are the tuples of 
D p and whose hyperedges are the conflicting sets of D p w.r.t. IC. 

Example 5. Consider a database instance D p having tuples t\ , . . . , tg, and a set of denial constraints IC stating 
that e\ — \t\, ?4, tf\, e2 — {t\, ?2, h, t$, t$, t^}, e^ — {h, ?6, tg), and e^ — {?6, t%} are conflicting sets of tuples. The conflict 
hypergraph HG(D P , IC) in Figure\3\concisely represents this fact. □ 

It is easy to see that, if IC contains binary denial constraints only, then the conflict hypergraph collapses to a graph. 

Example 6. Consider D p and IC — {ic} of our motivating example - observe that ic is a binary denial constraint. 
The graph representing HG(D P , IC) is shown in Figure® □ 

It is easy to see that the size of the conflict hypergraph is polynomial w.r.t. the size of D p (in particular, its num- 
ber of nodes is bounded by the number of tuples of D p ) and can be constructed in polynomial time w.r.t. the size of D p . 

Remark 1. Observe that the conflict hypergraph H(D P , IC) corresponds to a representation of the dual lineage of the 
constraint query qjc, i-e-, the boolean query qjc = V iceic(^ c ) which basically asks whether there is no model for D p 
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t — t 2 — t 3 

Figure 4. Conflict graph of 
the motivating example 

w.r.t. IC. For instance, consider the case of Example[3] A lineage of q jq is the DNF expression: (Xi AX2) V(^2 AX3), 
where each Xj corresponds to tuple % Thus, the semantics of the considered constraints is captured by the dual lineage, 
that is the CNF expression (Y\ V Y2) A (Y2 V Y3), where each Y-, - not(X,). It is easy to see that the conflict hypergraph 
(as described in Example |6]l is the hypergraph of this CNF expression. In the conclusions (Section [8}, we will 
elaborate more on this relationship between conflict hypergraphs and (dual) lineages of constraint queries: exploiting 
this relationship may help to tackle the problems addressed in this paper from a different perspective. 

4. Consistency checking 

Detecting inconsistencies is fundamental for certifying the quality of the data and extracting reliable information 
from them. In the deterministic setting, inconsistency typically arises from errors that occurred during the generation 
of the data, as well as during their acquisition. In the probabilistic setting, there is one more possible source of incon- 
sistency, coming from the technique adopted for estimating the "degree of uncertainty" of the acquired information, 
which determines the probability values assigned to the probabilistic tuples. Possible bad assignments of probability 
values can turn out when integrity constraints on the data domain (which typically encode certain information coming 
from well-established knowledge of the domain) are considered. 

In this section, we address the problem of checking this form of consistency, that is, the problem of checking 
whether the probabilities associated with the tuples are "compatible" with the integrity constraints defined over the 
data. It is worth noting that the study of this problem has a strong impact in several aspects of the management of 
probabilistic data: checking the consistency can be used during the data acquisition phase (in order to "certify" the 
validity of the model applied for determining the probabilities of the tuples), as well as a preliminary step of the 
computation of the query answers. Moreover, it is strongly interleaved with the problem of repairing the data, whose 
study is deferred to future work. 

Before providing the formal definition of the consistency checking problem, we introduce some basic notions and 
notations. Given a PDB schema T> p , a set of integrity constraint IC, and an instance D p of D p , we say that D 1 ' 
satisfies (resp., does not satisfy) IC, denoted as D p |= IC (resp., D p IC ) iff the set of models M{D P , IC) is not 
empty. In the following, we will say "consistent w.r.t." (resp., "inconsistent w.r.t") meaning the same as "satisfies" 
(resp., "does not satisfy"). 

We are now ready to provide the formal definition of the consistency checking problem. In this definition, as well 
as in the rest of the paper, we assume that a PDB schema T> p and a set of denial constraints IC over T) p are given. 

Definition 3 (Consistency Checking Problem (cc)). Given a PDB instance D p ofD p , the consistency checking prob- 
lem (cc) is deciding whether D p |= IC. 

We point out that, in our complexity analysis, D p and IC will be assumed of fixed size, thus we refer to data 
complexity. 

The following theorem states that cc is A^P-complete, and it easily derives from the interconnection of cc with 
the A^P-complete problem PSAT i22ll {Probabilistic satisfiability), which is the generalization of SAT defined as 
follows: "Let S — [C\, . . . , C m ) be a set ofm clauses, where each Ci is a disjunction of literals (i.e, possibly negated 
propositional variables x\,. . . ,x n ) and each Ci is associated with a probability /?,-. Decide whether S is satisfiable, 
that is, whether there is a probability distribution n over all the 2" possible truth assignments over x\,...,x n such 
that, for each C„ the sum of the probabilities assigned by n to the truth assignments satisfying C; is equal to pi" 
Basically, the membership in NP of cc derives from the fact that any instance of cc over a PDB D p can be reduced to 
an equivalent PSAT instance where: a) the propositional variables correspond to the tuples of D p , b) the constraints 
of cc are encoded into clauses with probability 1, c) the fact that the tuples are assigned a probability is encoded 

8 




Figure 3. A conflict hypergraph. 
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into a clause for each tuple, with probability equal to the tuple probability. As regards the hardness of cc for NP, it 
intuitively derives from the fact that the hardness of PSAT was shown in [22] for the case that only unary clauses have 
probabilities different from 1 : thus, this proof can be applied on cc, by mapping unary clauses to tuples and the other 
clauses (which are deterministic) to constraints Q. 

Theorem 1 (Complexity of cc). cc is NP-complete. 

In the following, we devote our attention to determining tractable cases of cc, from two different perspectives. 
First, in Section |4~T1 we will show tractable cases which depend from the structural properties of the conflict hyper- 
graph, and, thus, from how the data combine with the constraints. The major results of this section are that cc is 
tractable if the conflict hypergraph is either a hypertree or ring. Then, in Section l4~2l we will show syntactic condi- 
tions on the constraints which make cc tractable, independently from the shape of the conflict hypergraph. At the end 
of the latter section, we also discuss the relationship between these two kinds of tractable cases. 

4.1. Tractability arising from the structure of the conflict hypergraph 

It is worth noting that, since there is a polynomial-time reduction from cc to PSAT, the tractability results for 
PSAT may be exploited for devising efficient strategy for solving cc. In fact, in 12211 . it was shown that 2PSAT (where 
clauses are binary) can be solved in polynomial time if the graph of clauses (which contains a node for each literal and 
an edge for each pair of literals occurring in the same clause) is outerplanar. This result relies on a suitable reduction 
of 2PSAT to a tractable instance of 2MAXSAT (maximum weight satisfiability with at most two literals per clause). 
Since, in the case of binary denial constraints, the conflict hypergraph is a graph and the above-discussed reduction 
of cc to PSAT results in an instance of 2PSAT where the graph of clauses has the same "shape" of our conflict graph, 
we have that cc is polynomial-time solvable if denial constraints are binary and the conflict graph is outerplanar. 
However, on the whole, reducing 2PSAT to 2MAXSAT and then solving the obtained 2MAXSAT instance require a 
high polynomial-degree computation (specifically, the complexity is 0(n b log «), where n is the number of literals in 
the PSAT formula, corresponding to the number of tuples in our case). 

Here, we detect tractable cases of cc, which, up to our knowledge, are not subsumed by any known tractability 
result for PSAT. Our tractable cases have the following amenities: 

- no limitation is put on the arity of the constraints; 

- instead of exploiting reductions of cc to other problems, we determine necessary and sufficient conditions which can 
be efficiently checked (in linear time) by only examining the conflict hypergraph and the probabilities of the tuples. 

Our main results regarding the tractability arising from the structure of the conflict hypergraph (which will be 
given in sections |4. 1 .21 and 14. 1 3\ are that consistency can be checked in linear time over the conflict hypergraph if it 
is either a hypertree or a ring. 

4.1.1. New notations and preliminary results 

Before providing our characterization of tractable cases arising from the structure of the conflict hypergraph, we 
introduce some preliminary results and new notations. Given a hypergraph H = (N, E) and a hyperedge e e E, the 
set of intersections of e with the other hyperedges of H is denoted as Int(e, H) — {s \ 3e' e E s.t. e'+el\s-eC\ e'}. 
For instance, for the hypertree H in Figure[2]c), Int{e\,H) = {{?2, h), fe, h, f/t}}. Moreover, given a set of sets S , we 
call S a matryoshka if there is a total ordering si, . . . , s„ of its elements such that, for each i,j e [l..n] with i < j it 
holds that s\ c 52 c ■ • ■ c s n . For instance, the above-mentioned set Int{e\,H) is a matryoshka. Finally, given a set 
of hyperedges S , we denote as H s the hypergraph obtained from H by removing the edges of S and the nodes in 
the edges of S which do not belong to any other edge of the remaining part of H. That is, H s = (N',E'), where 
E' — E\S , N' = Uee£' e - F° r instance, for the hypergraph H in Figure|2ja), H~ l - eii is obtained by removing e\ from 
the set of edges of H, and ?2 from the set of its nodes. Analogously, H~ l - e},e ^ will not contain edges e\ and ei, as well 
as nodes t\, ?2, ?6- 



1 However, we will not provide a formal proof of the AP-hardness of cc based on this reasoning, that is, based on reducing hard instances of 
PSAT to cc instances. Indeed, a formal proof of the hardness will be provided for the theorems |3]and2]introduced in Section fO] which are more 
specific in stating the hardness of cc in that they say that cc is AP-hard in the presence of denial constraints of some syntactic forms. 
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The first preliminary result (Proposition [T] states a general property of hypertrees: any hypertree H contains at 
least one edge e which is attached to the rest of H so that the set of intersections of e with the other edges of H is 
a matryoshka. Moreover, removing this edge from H results in a new hypergraph which is still a hypertree. This 
result is of independent interest, as it allows for reasoning on hypertrees (conforming to the y-acyclicity property) by 
using induction on the number of hyperedges: any hypertree with x edges can be viewed as a hypertree with x — 1 
edges which has been augmented with a new edge, attached to the rest of the hypertree by means of sets of nodes 
encapsulated one to another. 

Proposition 1. Let H = (N, E) be a hypertree. Then, there is at least one hyperedge e e E such that Int(e, H) is a 
matryoshka. Moreover, H~^ is still a hypertree. 

As an example, consider the hypertree in Figure |2jc). As ensured by Proposition!]] this hypertree contains the edge 
e\ whose set of intersections with the other edges is {\ti, t{\, \tz, £3, £4}}, which is a matryoska. Moreover, removing e\ 
from the set of hyperedges, and the ears of e\ from the set of nodes, still yields a hypertree. The same holds for e% and 
en, but not for ej,. 

The second preliminary result (which will be stated as Lemma [TJ regards the minimum probability that a set of 
tuples co-exist according to the models of the given PDB. Specifically, given a set of tuples T of the PDB D p , we 
denote this minimum probability as p min (T), whose formal definition is as follows: 



The following example clarifies the semantics of p mm . 

Example 7. Consider the case discussed in Example [2] ( the same as Case 2 of our motivating example, but with 
IC — %). Here, every interpretation is a model. Hence, p aaD (t\,tz) = 0, as there is an interpretation (for instance, Pr^ 
or Pr4 in Table\Q which assigns probability to both the possible worlds {f 1 , ?3 } and {t\ , ?2, tj } - the worlds containing 
both t\ and £3. However, if we impose IC — {ic} of the motivating example, we have that p mm (?i,?3) = 1/2, as 
according to Pr^ (the unique model for the database w.r.t. IC) the probabilities of worlds [t\, £3} and [t\, t%, h\ are, 
respectively, 1/2 and (hence, their sum is 1/2). □ 

Lemma Q] states that, for any set of tuples T = {?;,...,?„}, independently from how they are connected in the 
conflict hypergraph, the probability that they co-exist, for every model, has a lower bound which is implied by their 
marginal probabilities. This lower bound is max Jo, Yi1=\ P(ti) —n + l}, which is exactly the minimum probability of 
the co-existence of t\ ,...,£„ in two cases: z) the case that t \ , . . . , t„ are pairwise disconnected in the conflict hypergraph 
(which happens, for instance, in the very special case that t\,...,t„ are not involved in any constraint); it) the case 
that the set of intersections of T with the edges of H is a matryoshka. This is interesting, as it depicts a case of 
tuples correlated through constraints which behave similarly to tuples among which no correlation is expressed by 
any constraint. 

Lemma 1. Let D p be an instance of D p consistent w.r.t. IC, T a set of tuples of D p , and let H denote the conflict 
hypergraph HG(D P , IC). If either i) the tuples in T are pairwise disconnected in H, or ii) Int(T, H) is a matryoshka, 
then p mm (T) — max {0, p(0 ~ \T\ +1). Otherwise, this formula provides a lower bound for p"""(T). 

4.1.2. Tractability of hypertrees 

We are now ready to state our first result on cc tractability. 



Theorem 2. Given an instance D p of D p , if HG(D P , IC) is a hypertree, then D p |= IC iff, for each hyperedge e of 
HG(D P , IC), it holds that 



Proof. (=>): We first show that if there is a model for D p w.r.t. IC, then inequality (fl} holds for each hyperedge 
of HG(D P , IC). Reasoning by contradiction, assume that D p |= IC and there is an hyperedge e = [t\, ...,t„} of 
HG(D P , IC) such that £"=i p(ti) - n + 1 > 0. Since this value is a lower bound for p min (ti, . ..,t„) (due to LemmaQ]!, 
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it holds that every model M for D p w.r.t. IC assigns a non-zero probability to some possible world containing all the 
tuples 1 1 , . . . , t„. This contradicts that M is a model, since any possible world containing t\,...,t n does not satisfy IC. 
(<=): We now prove that if inequality ([TJ holds for each hyperedge of HG(D P , IC), then there is a model for D p w.r.t. 
IC. We reason by induction on the number of hyperedges of HG(D P , IC). 

The base case is when HG(D P , IC) consists of a single hyperedge e - {t\, . . . , tk). Consider the same database 
D p , but impose over it the empty set of denial constraints, instead of IC. Then, from Lemma [JJ (case ij), we 
have that there is at least one model M for D p (w.r.t. the empty set of constraints) such that 2»o{Ji,...y M(w) = 
max {0, 2? =1 P(ti) - k + l} . The term on the right-hand side evaluates to 0, as, from the hypothesis, we have that 
D*=i P(h) < k — 1. Hence, M is a model for D p also w.r.t. IC, since the only constraint entailed by IC is that the 
tuples 1 1 , . . . , tk can not be altogether in any possible world with non-zero probability. 

We now prove the induction step. Consider the case that H = HG(D P , IC) is a hypertree with n hyperedges. The 
induction hypothesis is that the property to be shown holds in the presence of any conflict hypergraph consisting of a 
hypertree with n — 1 hyperedges. Let e be a hyperedge of H such that Int(e, H) is a matryoshka, and H' = is a 
hypertree. The existence of e and the fact that H' is a hypertree are guaranteed by Proposition [1] We denote the nodes 
in e as t\, . . , , t' m , t", . . . , t%, where T = {t[, . . . , f m ] is the set of nodes of e in H' , and T" = {t . . . t' n '} are the ears of e. 
Correspondingly, D" is the portion of D p containing only the tuples ?", . . . f", and D' is the portion of D p containing 
all the other tuples (that is, the tuples corresponding to the nodes of H'). We consider D' associated with the set of 
constraints imposed by H' , and D" associated with an empty set of constraints. 

Thanks to the induction hypothesis, and to the fact that inequality ([1) holds, we have that D' is consistent w.r.t. 
the set of constraints encoded by H' . Moreover, since Int{e, H) is a matryoshka, we have that the set T is such that 
IntiT' , H') is a matrioshka too. Hence, from Lemma[TJ(case ii) we have that there is a model M' for D' w.r.t. H' such 
that YjwD\t\, .../„,} M'(w) = max {0, YJ?=\ P( ? ,0 —m + l}- We denote this value as p' , and consider the case that p' > 
(that is, p' = Yli=\ P(*0 _ OT + 1 as the case that p' — can be proved analogously). Since inequality ([TJ holds for every 
edge of HG(D P , IC), the following inequality holds for the tuples of e: £;=i..m jo(f') + Z;=i..n p(ff) - m — n + I < 0. 
The quantity m - Yii=\.jn / 7 ( f ,') i s ec l ua l to 1 - p' , that is the overall probability assigned by M' to the possible worlds 
of D' not containing at least one tuple t[,..., t' m . Denoting the probability 1 - p' as p', the above inequality becomes 
Yii=i..n P(fi) -i + l ^ p' ■ Owing to Lemma[TJ(case /), the term on the left-hand side corresponds to /?"""(?", . . . , t"). 

Intuitively enough, this suffices to end the proof, as it means that, if we arrange the tuples f", . . . , t' n ' according to 
a model M" for D" which minimizes the overall probability of the possible worlds of D" containing ?", . . . , f" alto- 
gether, the portion of the probability space invested to represent these worlds is less than the portion of the probability 
space invested by M' to represent the possible worlds of D' not containing at least one tuple among t' { , . . . , t' m . For the 
sake of completeness, we formally show how to obtain a model for D p w.r.t. IC starting from M' and M" . 

First of all, observe that any interpretation Pr can be represented as a sequence S(Pr) = (w\,p\), . . .,{wk,Pk) 
where: 

• w\, . . . , Wk are all the possible worlds such that Pr(w[) + for each i e [l..k]; 

• pi = Pr(wi); 

• for each i e [2..n] p, = + Pr(w,-) (that is, /?, is the cumulative probability of all the possible worlds in S(M) 
occurring in the positions not greater than /). In particular, this entails that p n = 1. 

It is easy to see that many sequences can represent the same interpretation Pr, each corresponding to a different 
permutation of the set of the possible worlds which are assigned a non-zero probability by Pr. 

Consider the model M' , and let a be the number of possible worlds which are assigned by M' a non-zero proba- 
bility and which do not contain at least one tuple among t' v . . . , t' m . Then, take a sequence S(M') such that the first a 
pairs are possible worlds not containing at least one tuple among f' p . . . ,t' m . In this sequence, denoting the generic pair 
occurring in it as (wj,pj), it holds that p' a = p'. 

Analogously, consider the model M" , and take any sequence S (M") where the first pair contains the possible 
world containing all the tuples f", . . . , t^. Obviously, denoting the generic pair occurring in S(M") as (w",/?") it 
holds that p" = p min (t", . . ., t' n ') is less than or equal to p'. 

Now consider the sequence S ' = (yv'", />"'), • • ■ ,' (wi", p'l') defined as follows: 
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• p'j", . . .,p'l' are the distinct (cumulative) probability values occurring in S(M') and S(M"), ordered by their 
values; 

• for each i e [l..k], W" = W- U w", where w' (resp., w") is the possible world occurring in the left-most pair of 
S(M') (resp., S(M")) containing a (cumulative) probability value not less than p'.". 

Consider the function / over the set of possible worlds of D p defined as follows: 

10 if w does not occur in any pair of S ' 

p'" if w occurs in the first pair of S ' 

p"' - if w occurs in the r'-th pair of S'(i > 1) 

It is easy to see that / is an interpretation for D p . In fact, by construction, it assigns to each possible world of 
D p a value in [0,1], and the sum of the values assigned to the possible worlds is 1. Moreover, the values assigned 
by / to the possible worlds are compatible with the marginal probabilities of the tuples, since, for each tuple t of D', 

X w <"[^<" fW") = £ w , ]t€w , M'(w') = p{t\ as well as for each tuple t of D", /0O = Ew"1,ew« M"{w") = 

Pit). 

In particular, / is also a model for D p w.r.t. IC: on the one hand, / assigns to every possible world containing 
tuples which are conflicting according to H' (this follows from how / was obtained starting from M'). Moreover, / as- 
signs to every possible world containing tuples which are conflicting according to the hyperedge e. In fact, the worlds 
containing all the tuples t' { , . . . , t' m , f", . . . , t'^ are assigned by /, since the worlds occurring in S ' containing t% 
do not contain at least one tuple among t[, . . ., t' m (this trivially follows from the fact that p' > p" n "(t", . . . , f"))- The 
fact that / is a model for D p w.r.t. IC means that D p \= IC. □ 

The above theorem entails that, if HG(D P , IC) is a hypertree, then cc can be decided in time 0(\E\ -k) over HG(D P , 
IC), where E is the set of hyperedges of HG(D P , IC) and k is the maximum arity of the constraints (which bounds 
the number of nodes in each hyperedge). The number of hyperedges in a hypertree is bounded by the number of nodes 
\N\ (this easily follows from Proposition!!}, thus 0(\E\ ■ k) = 0(\N\ ■ k). Interestingly, even if denial constraints of any 
arity were allowed, the consistency check could be still accomplished over the conflict hypertree in polynomial time 
(that is, replacing k with \N\, we would get the bound 0(\N\-)). 

Example 8. Consider the PDB schme T> p consisting of relation scheme Person p (Name, Age, Parent, Date, City, P) 
representing some personal data obtained by integrating various sources. A tuple over Person'' refers to a person, 
and, in particular, attribute Parent references the name of one of the parents of the person, while City is the city of 
residence of the person in the date specified in Date. Consider the PDB instance D p consisting of the instance person'' 
of Person'' shown in Figure&a). 




(a) (b) 

Figure 5. (a) PDB instance DP; (b) Conflict hypergraph HG(D'',IC) 



Assume that IC consists of the following constraints defined over Person'': 

ic\: ->[ Person{x\,yi,z\, Vi, w\) A Person(x\,y2,Zi, V2, W2) A Person{xi,y^,z-i, V3, W3) A z\ + Z2 A z\ + Z3 A zi + Z3], 
imposing that no person has more than 2 parents; 

ic2-' -<[ Person(x\,y\,z\,v\, w\) A Person(z\,y2,Z2, ^2, Wz) A y\ >y2], imposing that no person is older than any of 
her parents. 
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The conflict hypergraph HG(D P ,IC) is shown in Figure \5\b). Here, the conflicting sets e\, ei are originated 
by violations of ic\, while e^ is originated by the violation of ici- It is easy to check that HG(D P , IC) is a hy- 
pertree. In particular, observe that set of intersections of e\ with the other hyper-edges of HG(D P , IC), that is 
\nt{e\,HG(D p , IC)) — {{t-s}, {?3, t$}}, is a matryoshka. Analogously, Int(e2, HG(D P , IC)) is matryoshka as well. 

Since HG(D P , IC) is a hyper-tree, thanks to Theorem^ we can conclude that D p is consistent iff the following 
inequalities hold: 

P\+Pi+Pa<2; p 2 +pi+P4<2; p 3 +pi<l. 

□ 

Note that the condition of Theorem [2] is a necessary condition for consistency in the presence of conflict hyper- 
graphs of any shape, not necessarily hypertrees (in fact, in the proof of the necessary condition of Theorem we did 
not use the assumption that the conflict hypergraph is a hypertree). The following example shows that this condition 
is not sufficient in general, in particular when the conflict hypergraph contains "cycles". 

Example 9. Consider the hypergraph HG(D P , IC) obtained by augmenting the hypertree in Figure\3\with the hyper- 
edge e$ = {?8, tg] (whose presence invalidates the acyclicity of the hypergraph). Let the probabilities oft\, . . . , f 9 be as 
follows: 
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Although the condition of Theorem^holds for every hyperedge e lt with i in [1..5], there is no model of D p w.r.t. IC. 
In fact, the overall probability of the possible worlds containing fg must be 1/2; due to hyperedges e/s, and es, these 
possible worlds can not contain neither t(, nor tg, which must appear together in the remaining possible worlds (since 
the marginal probability of ?6 and tg is equal to the sum of the probabilities of the possible worlds not containing fg); 
however, as tj can not co-exist with both % and tg ( due to e^ ), it must be in the worlds containing t&; but, as the overall 
probability of these worlds is 1 /2, they are not sufficient to make the probability ofti equal to 3/4. □ 

4.1.3. "Cyclic" hypergraphs: cliques and rings 

An interesting tractable case which holds even in the presence of cycles in the conflict hypergraph is when the 
constraints define buckets of tuples: buckets are disjoint sets of tuples, such that each pair of tuples in the same bucket 
are mutually exclusive. The conflict hypergraph describing a set of buckets is simply a graph consisting of disjoint 
cliques, each one corresponding to a bucket. It is straightforward to see that, in this case, the consistency problem can 
be decided by just verifying that, for each clique, the sum of the probabilities of the tuples in it is not greater than 1 . 
Observe that the presence of buckets in the conflict hypergraph can be due to key constraints. Thus, what said above 
implies that cc is tractable in the presence of keys. However, we will be back on the tractability of key constraints in 
the next section, where we will generalize this tractability result to the presence of one FD per relation. 

We now state a more interesting tractability result holding in the presence of "cycles" in the conflict hypergraph. 

Theorem 3. Given an instance D p ofO p , ifH(D p , IC) = {N, E) is a ring, then D p |= IC iff both the following hold: 
1) Ve e E, Z,ee Pit) < \e\ - 1; 2) ZteN Pit) ~ W + rf 1 < 0. 

Interestingly, Theorem [3] states that, when deciding the consistency of tuples arranged as a ring in the conflict 
hypergraph, it is not sufficient to consider the local consistency w.r.t. each hyperedge (as happens in the case of 
conflict hypertrees), as also a condition involving all the tuples and hyperdges must hold. As an application of this 
result, consider the case that H(D P ,IC) is the ring whose nodes are t\, ?2, h, t\ (where: p(t\) = p(ti) = pih) = V 2 
and p(tn) = 1), and whose edges are: e\ = {fi,?2,?4}, e2 = {*i»*3,*4}, ^3 = {t2,tj,}. It is easy to see that property 1) 
of Theorem[3](which is necessary for consistency, as already observed) is satisfied, while property 2) is not (in fact, 
IZteN P(t) ~ W\ + nn - 5 / 2 _ 4 + 2 = V 2 > 0), which implies inconsistency. Note that changing pfa) to V 2 yields 
consistency. 

Remark 2. Further tractable cases due to the conflict hypergraph. The tractability results given so far can be 
straightforwardly merged into a unique more general result: cc is tractable if the conflict hypergraph consists of max- 
imal connected components such that each of them is either a hypertree, a clique, or a ring. In fact, it is easy to see 
that the consistency can be checked by considering the connected components separately. 
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4.2. Tractability arising from the syntactic form of the denial constraints 

We now address the determination of tractable cases from a different perspective. That is, rather than searching 
for other properties of the conflict hypergraph guaranteeing that the consistency can be checked in polynomial time, 
we will search for syntactic properties of denial constraints which can be detected without looking at the conflict 
hypergraph and which yield the tractability of cc. We start from the following result. 

Theorem 4. If IC consists of a join-free denial constraint, then cc is in PTIME. In particular, D p \= IC iff, for each 
hyperedge e ofHG(D p , IC), it holds that ^ t€e p(t) < \e\ — 1. 

Example 10. Consider the PDB scheme consisting of the probabilistic relation scheme Employee'^Name, Age, Team, 
P). This scheme is used to represent some ( uncertain) personal information about the employees of an enterprise. The 
uncertain data were obtained starting from anonymized data, and then estimating sensitive information (such as the 
names of the employees). Assume that the PDB instance D p obtained this way consists of the instance employee'^ of 
Employee^ shown in Figure&a). 
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Figure 6. (a) PDB instance D p \ (b) Conflict hypergraph HG(Di',IC) 



From some knowledge of the domain, it is known that at least one team among 'A ', 'B ', 'C ' consists of only 
young employees, i.e., employees at most 30-year old. This corresponds to considering IC = {ic} as the set of denial 
constraints, where ic is as follows: 

ic : -i [ Employee(xi , x%, 'A ' ) A Employee(x3, X4, 'B ' ) A Employee^, x$, 'C ' ) A X2 > 30 A x<\ > 30 A X(, > 30 ]. 

It is easy to see that ic is a join-free denial constraint, thus the consistency ofD p can be decided using Theorem^ 
In particular, since HG( D p , IC) is the hypergraph depicted in Figure&b), we have that D p is consistent if and only 
if the following inequalities hold: 

p{tx)+p{h)+p(h) < 2; p(h)+p(t 3 )+p(t7) <2; p(h) + p(t 4 ) + p(t 6 ) < 2; p{t x ) + p(t 4 ) +p(t 7 ) < 2; 

As a matter of fact, all these inequalities are satisfied, thus the considered PDB is consistent. In fact, there is a unique 
model Pr for D p w.r.t. IC. In particular, Pr assigns probability 1/2 to each of the possible worlds w\ — {h,t2,t3,t4,ts} 
and W2 — ?2, 15, tb, tj}, and probability to all the other possible worlds. □ 

The result of Theorem |4] strengthens what already observed in the previous section: the arity of constraints is not, 
per se, a source of complexity. In what follows, we show that the arity can become a source of complexity when 
combined with the presence of join conditions. 

Theorem 5. There is an IC consisting of a non-join-free denial constraint of arity 3 such that cc is NP-hard. 

Still, one may be interested in what happens to the complexity of cc for denial constraints containing joins and 
having arity strictly lower than 3. In particular, since in the proof of Theorem [5] we exploit a ternary EGD to show 
the /VP-hardness of cc in the presence of ternary constraints with joins (see |Appendix A.3| , it is worth investigating 
what happens when only binary EGDs are considered, which are denial constraints with arity 2 containing joins. The 
following theorem addresses this case, and states that cc becomes tractable for any IC consisting of a binary EGD. 

Theorem 6. If IC consists of a binary EGD, then cc is in PTIME. 
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Differently from the previous theorems on the tractability of cc, in the statement of Theorem [6] for the sake of 
presentation, we have not explicitly reported the necessary and sufficient conditions for consistency. In fact, in this 
setting, deciding on the consistency requires reasoning by cases, and then checking some conditions which are not 
easy to be defined compactly. However, these conditions can be checked in polynomial time, and the interested reader 
can find their formal definition in the proof of Theorem |6](see [Appendix A.3\ . 

Binary EGDs can be viewed as a generalization of FDs, involving pairs of tuples possibly belonging to different 
relations. For instance, over the relation schemes StudentiName, Address, University) and Employ ee(Name, Address, 
Firm), the binary EGD -i [Student(x\, X2, xj,) A Employee(x\, X3, X4) A X2 + X3 ] imposes that if a student and an 
employee are the same person (i.e., they have the same name), then they must have the same address. Thus, an 
immediate consequence of Theorem[6]is that cc is tractable in the presence of a single FD. 

The results presented so far refer to cases where IC consists of a single denial constraint. We now devote our 
attention to the case that IC is not a singleton. In particular, the last tractability result makes the following question 
arise: "Is cc still tractable when IC contains several binary EGDs?". (Obviously, we do not consider the case of 
multiple EGDs of any arity, as Theorem [5] states that cc is already hard if IC merely contains one constraint of this 
form.) The following theorem provides a negative answer to this question, as it states that cc can be intractable even 
in the simple case that IC consists of just two FDs (as recalled above, FDs are special cases of binary EGDs). 

Theorem 7. There is an IC consisting of 2 FDs over the same relation scheme such that cc is NP-hard. 

However, the source of complexity in the case of two FDs is that they are defined over the same relation (see the 
proof of Theorem[7]in [Appendix A. 3 I. As a matter of fact, the following theorem states that all the tractability results 



stated in this section in the presence of only one denial constraint can be extended to the case of multiple denial 
constraints defined over disjoint sets of relations. Intuitively enough, this derives from the fact that, if the denial 
constraints involve disjoint sets of relation, the overall consistency can be checked by considering the constraints 
separately. 

Theorem 8. Let each denial constraint in IC be join-free or a BEGD. If for each pair of distinct constraints ic\,ic2 
in IC, the relation names occurring in ic\ are distinct from those in ic% then cc is in PTIME. 

Hence, the above theorem entails that cc is tractable in the interesting case that IC consists of one FD per relation. 
In the following theorem, we elaborate more on this case, and specify necessary and sufficient conditions which can 
be checked to decide the consistency. 

Theorem 9. If IC consists of one FD per relation, then HG(D P , IC) is a graph where each connected component 
is either a singleton or a complete multipartite graph. Moreover, D p is consistent w.r.t. IC iff the following property 
holds: for each connected component C of HG(D P , IC), denoting the maximal independent sets of C as S i , . . . ,S k, it 
is the case that 2iEfL.it] Pi ^ 1> where pi — max,^ p(t). 

We recall that a complete multipartite graph is a graph whose nodes can be partitioned into sets such that an edge 
exists if and only if it connects two nodes belonging to distinct sets. Each of these sets is a maximal independent set 
of nodes. For instance, the portion of the graph in Figure |7jb) containing only the nodes t\, t%, ?3, t\, t^ is a complete 
multipartite graph whose maximal independent sets are S i = [t\, f2l, S2 = {?3, to], S3 = {ts}. The following example 
shows an application of Theorem|9] 

Example 11. Consider the PDB scheme consisting of the probabilistic relation scheme Person'' (Name, City, State, 
P), and its instance D p consisting of the instance person'' of Person'' shown in Figure\7\a). 

Consider the FD ic: City — > State, which can be rewritten as -i[ Person(xi, X2, X3) A Person(x4, X2, x$) A X3 ^x^ ]. 
The conflict hypergraph HG(D P , IC) is the graph depicted in Figure^b). It consists of 3 connected components: 
one of them is a singleton (and corresponds to the maximal independent set S 4), and the other two are the complete 
multipartite graphs over the maximal independent sets S 1, S2, S3 and S 5, S (,, respectively. Theorem\9\says that D p is 
consistent if and only if the following three inequalities (one for each connected component of HG(D P , IC)) hold: 

max{/7(fi), p(t 2 )} + max(pfe), p(t A )} + p(t 5 ) < 1; p(t 6 ) < 1; max{p(t 7 ), p(t g )} + p(t 9 ) < 1. 

As a matter of fact, all these inequalities are satisfied, thus the considered PDB is consistent. In fact, there is a 
model M for D p w.r.t. IC assigning probability 1/4 to each of the possible worlds wi — {t\,t2,t(,,ti,h\,W2 — {t\,t(,,t-]\, 
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Figure 7. (a) PDB instance DP; (b) Conflict hypergraph HG(D'',IC) 



W3 = {?3, ?4, ?6, tf\, and W4 — {ts, tg}, and probability to all the other possible worlds. The reader can easily check 
that there are models for D p w.r.t. IC other than M. □ 

4.3. Tractability implied by conflict-hypergraph properties vs. tractability implied by syntactic forms. 

The tractability results stated in sections |4~T1 and l4~2l can be viewed as complimentary to each other. In fact, an 
instance of cc may turn out to be tractable due the syntactic form of the constraints, even if the shape of the conflict 
hypergraph is none of those ensuring tractability, and vice versa. For instance, in the case that IC consists of a 
join-free denial constraint or a binary EGD, it is easy to see that the conflict hypergraph may not be a hypertree or a 
ring, but cc is nevertheless tractable due to theorems [2] and [3] Vice versa, if IC contains two FDs per relation or a 
ternary denial constraints with joins (which, potentially, are hard cases, due to theorems |5]and|7), cc may turn out to 
be tractable, if the way the data combine with the constraints yields a conflict hypergraph which is a hypertree or a 
ring (see theorems [2] and [3). 

On the whole, the tractability results presented in sections l4~Tl and l4~2"l can be used conjunctively when addressing 
cc: for instance, one can start by examining the constraints and check whether they conform to a tractable syntactic 
form, and, if this is not the case, one can look at the conflict hypergraph and check whether its structure entails 
tractability. 



5. Querying PDBs under constraints 

As explained in the previous section, given a PDB D p in the presence of a set IC of integrity constraints, not all 
the interpretations of D p are necessarily models w.r.t. IC. If D p is consistent w.r.t. IC, there may be exactly one 
model (Case 2 of the motivating example), or more (Case 3 of the same example). In the latter case, given that all the 
models satisfy all the constraints in IC, there is no reason to assume one model more reasonable than the others (at 
least in the absence of other knowledge not encoded in the constraints). Hence, when querying D p , it is "cautious" 
to answer to queries by taking into account all the possible models for D p w.r.t. IC. In this section, we follow this 
argument and introduce a cautious querying paradigm for conjunctive queries, where query answers consist of tuples 
associated with probability ranges: given a query Q posed over D p , the range associated with a tuple t in the answer of 
Q contains every probability with which t would be returned as an answer of Q if Q were evaluated separately on every 
model of D p . In what follows, we first introduce the formal definition of conjunctive query in the probabilistic setting, 
and introduce its semantics according to the above-discussed cautious paradigm. Then, we provide our contributions 
on the characterization of the problem of computing query answers. 

A (conjunctive) query over a PDB schema D p is written as a (conjunctive) query over its deterministic part 
det(D p ). Thus, it is an expression of the form: 
Q(x) = 3z. R\(yi) A ■ • • A R m (y m ) A (f>(y x , . . .,y m ), where: 

- R„, are name of relations in det{D p ); 

- x and z are tuples of variables, having no variable common; 
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- fi,... ,y,„ are tuples of variables and constants such that every variable in any y,- occurs in either for z, and vice 

versa; 

- (p(y\, . . . ,y m ) is a conjunction of built-in predicates, each of the form a o /3, where a and /? are either variables in 
y*\, ... , y,„ or constants, and o e {-, +, <, >, <, >}. 

A query Q will be said to be projection-free if z is empty. 

The semantics of a query Q over a PDB D p in the presence of a set of integrity constraints IC is given in two 
steps. First, we define the answer of Q w.r.t. a single model M of D p . Then, we define the answer of Q w.r.t. D p , 
which summarizes all the answers of Q obtained by separately evaluating Q over every model of D p . Obviously, we 
rely on the assumption that D p is consistent w.r.t. IC, thus M(D P , IC) is not empty. 

The answer of Q over a model M of D p w.r.t. IC is the set Ans M (Q, D p , IC) of pairs of the form (t, Pgifl) such 
that: 

- ns a ground tuple such that 3wepwd(D p ) s.t. w |= Q(i); 

wd(D/>)Aw\=Q(i) M(w) is the overall probability of the possible worlds where Q(t) evaluates to true, 

where w |= Q(F) denotes that Q(i) evaluates to true in w. 

In general, there may be several models for D p , and the same tuple t may have different probabilities in the answers 
evaluated over different models. Thus, the overall answer of Q over D p is defined in what follows as a summarization 
of all the answers of Q over all the models of D p . 

Definition 4 (Query answer). Let Q be a query over D p , and D p an instance of T> p . The answer of Q over D p is the 
set Ans(Q, D p , IC) of pairs (f, [p min , p max ]\ where: 

- 3MeM(D p , IC) s.t. tis a tuple in Ans M (£, D p , IC); 

- „ mi "= min {pq$}> p max = max {pq^}- 

F MeM(D".IC) MeM(D",IC) 

Hence, each tuple t in Ans(Q, D p , IC) is associated with an interval [p mm , p max ], whose extremes are, respectively, 
the minimum and maximum probability of ?in the answers of Q over the models of D p . Examples of answers of a 
query are reported in the motivating example. In the following, we say that f*is an answer of Q with minimum and 
maximum probabilities p mhl and p max if {t, [p mhl , /7 max ]> eAns(Q, D p , IC). 

The following proposition gives an insight on the semantics of query answers, as it better explains the meaning of 
the probability range associated with each tuple occurring in the set of answers of a query. That is, it states that, taken 
any pair (f, [p mm ,p max ]) in Ans(Q, D p , IC), every value p inside the interval [p min ,/? max ] is "meaningful", in the sense 
that there is at least one model for which f is an answer of Q with probability p. Considering this property along the 
fact that the boundaries /?™ n , p max are the minimum and maximum probabilities of fas an answer of Q (which follows 
from Definition|4|i, we have that [p mm , p max ] is the tightest interval containing all the probabilities of t as an answer of 
Q, and is dense (every value inside it corresponds to a probability of t as an answer of Q). 

Proposition 2. Let Qbea query over D p , and D p an instance ofD p . For each pair (t*, [p mm , /? max ]) in Ans(g, D p , IC), 
and each probability value p € [p mm ,p max \ there is a model M of D p w.r.t. IC such that {t, p) e Ans M (g, D p , IC). 

Proof. We first introduce a system S (D p , IC, D p ) of linear (in)equalities whose solutions one-to-one correspond to 
the models of D p w.r.t. IC. For every w, e pwd(D p ), let v, be a variable ranging over the domain of rational numbers. 
The variable v, will be used to represent the probability assigned to w, by an interpretation of D p . The system of linear 
(in)equalities S (D p , IC, D p ) is as follows: 

' Vf e D p , Zi\ Wi epwd(DnA,E Wi vi = p(t) (el) 

2/|ir,epivrf(DP)Aii#JC v i ~ ( e 2) 
Tii\wi£pwd(Di>) v i - 1 ( e 3) 

Vw; 6 pwd(D p ), vi > (e4) 

The first \D P \ equalities (el) in S(D P ,IC,D P ) encode the fact that, for each tuple t in the PDB instance, the sum 
of the probabilities assigned to the worlds containing the tuple t must be equal to the marginal probability of t. The 
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subsequent two equalities (e2), (el), along with the inequalities (e4) imposing that the probabilities v, assigned to each 
possible world are non-negative, entail that the probability assigned to any world violating IC is 0, as well as that the 
probabilities assigned to all the possible worlds sum up to 1 . 

It is easy to see that every solution s of S (D p , IC , D p ) one-to-one corresponds to a model Pr for D p w.r.t. IC, 
where Pr(wf) is equal to Vi[s], i.e., the value of V; in s. 

We now consider the system of linear (in)equalities S *{D P , IC, D p ) obtained by augmenting the set of (in)equalities 
in S (IF, IC, D p ) with the following equality: 

v * = z 

i[W(€pwd(DP)AWi|=f 

where v* is a new variable symbol not appearing in S (& ', IC, D p ). 

Obviously, every solution s of S*(D P , IC, D p ) still one-to-one corresponds to a model Pr for D p w.r.t. IC such 
that, for each possible world w>; 6 pwd(D p ), Pr(wi) is equal to v,-[s], and v*[s] (the value of v* in s) is equal to the sum 
of the probabilities assigned by Pr to the possible worlds where t is an answer of Q. Therefore, p mm (resp. p mm ) is 
the solution of the following linear programming problem LP(S *): 

minimize (resp. maximize) v* 
subject to S*(D P ,IC, D p ) 

Since the feasible region shared by the min- and max- variants of LP(S*) is defined by linear inequalities only, 
it follows that it is a convex polyhedron. Hence, the following well-known result [42] can be exploited: "given two 
linear programming problem LP\ and LP2 minimizing and maximizing the same objective function f over the same 
convex feasible region S , respectively, it is the case that for any value v belonging to the interval [v mln , y max ] ; whose 
extreme values are the optimal solutions of LP\ and LP2, respectively, there is a solution s of S such that v is the 
value taken by f when evaluated over s". This result entails that, for every probability value p 6 [p mm , /? max ] taken by 
the objective function v* of LP(S*), there is a feasible solution s of S*(D P ,IC,D P ) such that p = v*[s]. Hence, the 
statement follows from the fact that every solution of S *(D P , IC, D p ) one-to-one corresponds to a model for D p w.r.t. 
IC. □ 

The definition of query answers with associated ranges is reminiscent of the treatment of aggregate queries in 
inconsistent databases ||4J]. In that framework, the consistent answer of an aggregate query Agg is a range [vi,V2], 
whose boundaries represent the minimum and maximum answer which would be obtained by evaluating Agg on at 
least one repair of the database. However, the consistent answer is not, in general, a dense interval: for instance, it can 
happen that there are only two repairs, one corresponding to Vi and one to v>2, while the values between Vi and V2 can 
not be obtained as answers on any repair. 

In the rest of this section, we address the evaluation of queries from two standpoints: we first consider a decision 
version of the query answering problem, and then we investigate the query evaluation as a search problem. In the 
following, besides assuming that a database schema T) p and a set of constraints IC of fixed size are given, we also 
assume that queries over T) p are of fixed size. Thus, all the complexity results refer to data complexity. 

5.1. Querying as a decision problem 

In the classical "deterministic" relational setting, the decision version of the query answering problem is com- 
monly defined as the membership problem of deciding whether a given tuple belongs to the answer of a given query. 
In our scenario, tuples belong to query answers with some probability range, thus it is natural to extend this definition 
to our probabilistic setting in the following way. 

Definition 5 (Membership Problem (mp)). Given a query Q over D p , an instance D p ofT> p , a ground tuple t, and the 
constants k\ and ki ( with 0<k\<k2<l ), the membership problem is deciding whether t is an answer of Q with minimum 
and maximum probabilities p mm and p max such that p mn >ki and p max <ki. 

Hence, solving mp can be used to decide whether a given tuple is an answer with a probability which is at least k\ 
and not greater than hi. Observe that Definition [5] collapses to the classical definition of membership problem when 
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data are deterministic: in fact, asking whether a tuple belongs to the answer of a query posed over a deterministic 
database corresponds to solving mp over the same database with k\ — ki — 1 . 

From the results in II35I1 . where an entailment problem more general than mp was shown to be in coNP (see Section 
|7), it can be easily derived that mp is in coNP as well. The next theorems (which are preceded by a preliminary lemma) 
determine two cases when this upper bound on the complexity is tight. 

Lemma 2. Let Q be a conjunctive query over D p , D p an instance of SD P , and t an answer of Q having minimum 
probability p mm and maximum probability p max . Let m be the number of tuples in D p plus 3 and a be the maximum 
among the numerators and denominators of the probabilities of the tuples in D p . Then p mm and p mdx are expressible 
as fractions of the form %, with < tj < (ma) m andO < 6 < (ma)" 1 . 

Theorem 10 (Lower bound of mp). There is at least one conjunctive query containing projection for which mp is 
coNP-hard, even if IC is empty. 

Proof. We show a LOGSPACE reduction from the consistency checking problem (cc) in the presence of binary denial 
constraints, which is A^P-hard (see Theorem|7]i, to the complement of the membership problem (mp). 

Let (D P C ,IC CC , Dec) be an instance of cc. We construct an equivalent instance {lX-,IC-^, D-, Q, t@, k\,k2) of mp 
as follows. 

- Tf— consists of relation schemas R p (tid, P) and S p (tid\ , t idi, P); 

- ICw = 0, that is, no constraint is assumed on Tf—\ 

Mr mp' 

- D'L- is the instance of ££- which contains, for each tuple t e D p c , the tuple R p (id(t), p(t)), where id(t) is a unique 
identifier associated to the tuple t . Moreover, D'L. contains, for each pair of tuples t\ , in D P C which are conflicting 
w.r.t. ICcc, the tuple S p (id(t\), id(t 2 ), 1). 

- Q = 3x,yR(x) A R(y) A S(x,y); 

- t% is the empty tuple; 

- the lower bound k\ of the minimum probability of t® as answer of Q is set equal to k\ - jj^p, where m is the 
number of tuples in Dt- plus 3, and a the maximum among the numerators and denominators of the probabilities of 
the tuples in D—\ 

* MP 7 

- the upper bound k 2 of the maximum probability of t@ as answer of Q is set equal to 1. 

Obviously, the mp instance returns true iff the minimum probability that t® is an answer to Q over Dt- is (strictly) 
less than k\. 

It is easy to see that every interpretation of D P C (the database in the cc instance) corresponds to a unique interpreta- 
tion of D— (the database in the mp instance), and vice versa. Observe that D— is consistent, since the set of constraints 

MP v / MP 

considered in the mp instance is empty. 

We show now that the above-considered cc and mp instances are equivalent, that is, the cc instance is true iff the mp 
instance is true. On the one hand, if the cc instance is true, then there is at least is one model Pr cc for D P C w.r.t. IC CC 
(that is, Pr cc assigns probability to every possible world w which contains tuples which are conflicting according to 
IC CC ). It is easy to see that evaluating Q on the corresponding interpretation Pr^ of mp yields probability for the 
empty tuple f@. Hence, the mp instance is true in this case. 

On the other hand, if the mp instance is true, then the minimum probability that is an answer of Q must be 
less thanj— i^-. Since is the smallest non-zero value that can be assumed by the minimum probability of t% (see 
Lemma[2f , this implies that the minimum probability that t% is an answer of Q is 0. This means that there is a model 
Pr^f that assigns probability to every possible world w which contains three tuples R(x\), R(y\) and S(x2,y2) with 
X\ = X2 and y\ = y 2 . It is easy to see that the corresponding interpretation Pr cc is a model for D P C w.r.t. i"C cc , as it 
assigns probability to every possible world which contains conflicting tuples. Hence the cc instance is true in this 
case. □ 

The above theorem establishes that the type of the query, and in particular that fact that it contains projection, is 
an important source of complexity making mp hard, irrespectively of the constraints considered. For projection-free 
queries, the next theorem states that mp remains hard even if only binary constraints are considered. 
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Theorem 11 (Lower bound of mp). There is at least one projection- free conjunctive query and a set IC consisting of 
only binary constraints for which mp is coNP-hard. 

We recall that, when addressing mp, we assume that the database is consistent w.r.t. the constraints. Thus, the 
hardness results for mp do not derive from any source of complexity inherited by mp from cc. On the whole, theorems 
[10] and QT| suggest that mp has at least two sources of complexity: the type of query (the fact that the query contains 
projection or not), and the form of the constraints. 

Once some sources of complexity of mp have been identified, the problem is worth addressing of determining 
tractable cases. We defer this issue after the characterization of the query evaluation as a search problem, since, as it 
will be clearer in what follows, the conditions yielding tractability of the latter problem also ensure the tractability of 

MP. 



5.2. Querying as a search problem 

Viewed a search problem, the query answering problem (qa) is the problem of computing the set Ans(Q, D p , IC). 
The complexity of this problem is characterized as follows. 

Theorem 12. qa is in FP NP and is FP^ ^ -hard. 

The fact that qa is in FP NP means that our "cautious" query evaluation paradigm is not more complex than the query 
evaluation based on the independence assumption, which has been shown in 111 111 to be complete for #P (which strictly 
contains FP NP , assuming P^NP). On the other hand, the hardness for FP NP ^° S '^ is interesting also because it tightens 
the characterization given in 113511 of the more general entailment problem for probabilistic logic programs containing a 
general form of probabilistic rules (conditional rules). Specifically, in l35ll . the above-mentioned entailment problem 
was shown to be in , but no lower bound on its data complexity was stated. Thus, our result enriches the 



characterization in 0511 . as it implies that fp Np i l °s n ] j s a lower bound for the entailment problem for probabilistic 
logic programs under data complexity even in the presence of rules much simpler than conditional rules. More details 
are given in Section [7] where we provide a more thorough comparison with 03511 . However, finding the tightest 
characterization for qa remains an open problem, as it might be the case that qa is complete for either FP NP ^ og "^ 
or FP NP . We conjecture that none of these cases holds (thus a characterization of qa tighter than ours can not be 
provided), thus qa is likely to be in the "limbo" containing the problems in FP NP but not in FP NP ^° S "\ without being 
hard for the former (this limbo is non-empty if P±NP \3(M). 

5.3. Tractability results 

In this section, we show some sufficient conditions for the tractability of the query evaluation problem, which hold 
for both its decision and search versions. When stating our results, we refer to qa only, as its tractability implies that 
of mp (as mp is straightforwardly reducible to qa). 

Again, we address the tractability from two standpoints: we will show sufficient conditions which regard either 
a) the shape of the conflict hypergraph, or b) the syntactic form of the constraints. Specifically, we focus on finding 
islands of tractability when queries are projection-free and either the conflict hypergraph collapses to a graph - as for 
direction a), or the constraints are binary - as for direction b). These are interesting contexts, since TheoremfTTIentails 
that mp (and, thus, also qa) is, in general, hard in these cases (indeed, Theorem[TT]implicitly shows the hardness for 
the case of conflict hypergraphs collapsing to graphs, as, in the presence of binary constraints, the conflict hypergraph 
is a graph). 

The next result goes into direction a), as it states that, for projection-free queries, qa is tractable if the conflict 
hypergraph is a graph satisfying some structural properties. 

Theorem 13. For projection- free conjunctive queries, qa is in PTIME ifHG(D p , IC) is a graph where each maximal 
connected component is either a tree or a clique. 

The polynomiality result stated above is rather straightforward in the case that each connected component is a 
clique, but is far from being straightforward in the presence of connected components which are trees. Basically, when 
the conflict hypergraph is a tree, the tractability derives from the fact that, for any conjunction of tuples, its minimum 
(or, equivalently, maximum) probability can be evaluated as the solution of an instance of a linear programming 
problem. In particular, differently from the "general" system of inequalities used in the proof of Proposition[2](where 
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the variables corresponds to the possible worlds, thus their number is exponential in the number of tuples), here we 
can define a system of inequalities where both the number of inequalities and variables depend only on the arity of the 
query (which is constant, as we address data complexity). We do not provide an example of the form of this system 
of inequalities, as explaining the correctness of the approach on a specific case is not easier than proving its validity 
in the general case. Thus, the interested reader is referred to the proof of Theorem [Qlreported in |Appendix A.6| for 
more details. 

The following result goes into direction of locating tractability scenarios arising from the syntactic form of the 
constraints, as it states that, if IC consists of one FD for each relation scheme, the evaluation of projection-free queries 
is tractable. 

Theorem 14. For projection-free conjunctive queries, qa is in PTIME if IC consists of at most one FD per relation 
scheme. 

Proof. We consider the case that IC contains one relation and one FD only, as the general case (more relations, and 
one FD per relation) follows straightforwardly. Let the denial constraint ic in IC be the following FD over relation 
scheme R: X — > Y, where X, Y are disjoint sets of attributes of R. We denote as r the instance of R in the instance 
of qa. Constraint ic implies a partition of r into disjoint relations, each corresponding to a different combination of 
the values of the attributes in X in the tuples of r. Taken one of this combinations i*(i.e., x e Tlx(r)), we denote the 
corresponding set of tuples in this partition as r(x). That is, r(x) = {t e r\Ilx(t) = A- In turn, for each r(x), ic partitions 
it into disjoint relations, each corresponding to a different combinations of the values of the attributes in Y. Taken one 
of this combinations y (i.e., y e IIy(r(x))), we denote the corresponding set of tuples in this partition as r(x,y). 

Given this, constraint ic entails that the conflict hypergraph is a graph with the following structure: there is an 
edge {t\,t2> iff 3x,y~i,y~2, with y\ + y\ such that t\ e r(x,y\) and ti € r(x,y~2). 

Now, consider any conjunction of tuples T = t\, . . . ,t„. The probability of T as an answer of the query q specified 
in the instance of qa can be computed as follows. First, we partition {t\,...,t n } according to the maximal connected 
components of the conflict hypergraph. This way we obtain the disjoint subsets T\,...,Tk of {t\,...,t„}, where 
each Tj corresponds to a maximal connected component of the conflict hypergraph, and contains all the tuples of 
[t\, . . .,t n ] which are in this component. The minimum and maximum probabilities of T as answer of q can be 
obtained by computing the minimum and maximum probability of each set Tj, and then combining them using the 
well known Frechet-Hoeffding formulas (reported also in the appendix as Fact |2), which give the minimum and 
maximum probabilities of a conjunction of events among which no correlation is known (in fact, since T\, . . . , 7* 
correspond to distinct connected components, they can be viewed as pairwise uncorrelated events). 

Then, it remains to show how the minimum and maximum probabilities of a single Tj can be computed. We 
consider the case that Tj contains at least two tuples (otherwise, the minimum and the maximum probabilities of Tj 
coincide with the marginal probability of the unique tuple in Tj). If 3t a , tp e Tj 3x,y\,y~2 such that t a + tp and y\ + y~2 
and t a e r(x,y\), while tp e r(x,y~2), then the minimum and maximum probabilities of Tj are both (since {t a , tp) is 
a conflicting set). Otherwise, it is the case that all the tuples in Tj share all the values xfor the attributes X, and the 
same values y for the attributes Y. Due to the structure of the conflict hypergraph, it is easy to see that this implies that 
the tuples in Tj can be distributed in any way in the portion of the probability space which is not invested to represent 
the tuples having the same values x for X, but combinations for Y other than y. The size of this probability space is 
5 = 1- 2y>#y max f/'( f )l f e r (x,y*)}- Hence, the minimum and maximum probabilities of Tj are: 

p min = max {0, Z,eT, P(t) ~ Wi\ + S\, p max = min {p(t) 1 1 e Tj}. 

The first formula is an easy generalization of the corresponding formula for the minimum probability given in 
Lemma Q] to the case of a probability space of a generic size less than 1 . The second formula derives from the 
above-recalled Frechet-Hoeffding formulas, and from the fact that the database is consistent (we recall that we rely on 
this assumption when addressing the query evaluation problem). □ 

Again, observe that the last two results are somehow complementary: it is easy to see that there are FDs yielding 
conflict hypergraphs not satisfying the sufficient condition of TheoremQj] as well as conflict hypergraphs which are 
trees generated by some "more general" denial constraint, not expressible as a set of FDs over distinct relations. 
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6. Extensions of our framework 

Some extensions of our framework are discussed in what follows. In particular, for each extension, we show its 
impact on our characterization of the fundamental problems addressed in the paper. 



6.1. Tuples with uncertain probabilities 

All the results stated in this paper can be trivially extended to the case that tuples are associated with ranges of 
probabilities, rather than single probabilities (as happens in several probabilistic data models, such as OUI35IO . 

Obviously, all the hardness results for cc, mp, qa hold also for this variant, since considering tuples with single 
probabilities is a special case of allowing tuples associated with range of probabilities. 

As regards cc, both the membership in NP and the extendability of the tractable cases straightforwardly derive 
from the fact that, as only denial constraints are considered, deciding on the consistency of an assignment of ranges 
of probabilities can be accomplished by looking only at the minimum probabilities of each range. 

As regards mp and qa, the fact that the complexity upper-bounds do not change follows from the results in B5I1 . 
Finally, it can be shown, with minor changes to the proof of Theorem [13] that mp and qa are still tractable un- 
der the hypotheses on the shape of the conflict hypergraph stated in this theorem. We refer the interested reader 
Appendix A. 7 where a hint is given on how the proof of Theorem Q~3] can be extended to deal with tuples with 



to 



uncertain probabilities. The extension of the tractability results for mp and qa regarding the syntactic forms of the 
constraints is even simpler, and can be easily understood after reading the proofs of these results. 



6.2. Associating constraints with probabilities. 

Another interesting extension consists in allowing constraints to be assigned probabilities. In our vision, con- 
straints should encode some certain knowledge on the data domain, thus they should be interpreted as deterministic. 
However, this extension can be interesting at least from a theoretical point of view, or when constraints are derived 
from some elaboration on historical data 111 811 . Thus, the point becomes that of giving a semantics to the probability 
assigned to the constraints. The semantics which seems to be the most intuitive is as follows: "A constraint with 
probability p forbidding the co-existence of some tuples is satisfied if there is an interpretation where the overall 
probability of the possible worlds satisfying the constraint is at least p". This means that the condition imposed by 
the constraint must hold in a portion of size p of the probability space, while nothing is imposed on the remaining 
portion of the probability space. 

Starting from this, we first discuss the impact of associating constraints with probabilities on our results about cc. 
First of all, it is easy to see that there is a reduction from any instance Prob-cc of the variant of cc with probabilistic 
constraints to an equivalent instance Std-cc of the standard version of cc. Basically, this reduction constructs the 
conflict hypergraph H(Std-cc) of Std-cc as follows: denoting the conflict hypergraph of Prob-cc as H(Prob-cc), each 
hyperedge e e H(Prob-cc) (with probability p(e)) is transformed into a hyperedge e' of H(Std-cc) which consists of 
the same nodes in e plus a new node with probability p(e). On the one hand, the existence of this reduction suffices to 
state that also the probabilistic version of cc is A^f-complete. On the other hand, it is worth noting that applying this 
reduction yields a conflict hypergraph H(Std-cc) with the same "shape" as H(Prob-cc), except that each hyperedge 
has one new node, belonging to no other hyperedge: hence, if H(Prob-cc) is a hypertree (resp., a ring), then H(Std-cc) 
is a hypertree (resp., a ring) too. This means that all the tractability results given for cc concerning the shapes of the 
conflict hypergraph hold also when stated directly on its probabilistic version. However, this does not suffice to extend 
the tractability results for cc regarding the syntactic forms of the constraints, as in the considered cases the conflict 
hypergraph may not be a hypertree or a ring. Thus, the extension of the tractability results on the syntactic forms is 
deferred to future work. 

As regards mp and qa, the arguments used in the discussion of the previous extension can be used to show that 
our lower and upper bounds still hold for the variants of these problems allowing probabilistic constraints. As for the 
tractability results, in |Appendix A.7j a more detailed discussion is provided explaining how the proof of Theorem [T3l 
(which deal with conflict hypergraphs where each maximal connected componenent is either a clique or a tree) can be 
extended to deal with probabilistic constraints. The extension of the tractability result for FDs stated in Theorem [T4l 
is deferred to future work. 
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6.3. Assuming pairs of tuples as independent unless this contradicts the constraints 

As observed in the introduction, in some cases, rejecting the assumption of independence for some groups of 
tuples may be somehow "overcautious". For instance, if we consider further tuples pertaining to a different hotel in 
the introductory example (where constraints involve tuples over the same hotel), it may be reasonable to assume that 
these tuples encode events independent from those pertaining hotel 1 . 

A naive way of extending our framework in this direction is that of assuming every pair of tuples which are not 
explicitly "correlated" by some constraint as independent from one another. This means considering as independent 
any two tuples t\, tz such that there is no hyperedge in the conflict hypergraph containing both of them. However, 
this strategy can lead to wrong interpretations of the data. For instance, consider the case of Example [3] where each 
of the three tuples t\, tz, h has probability 1 /2, and two (ground) constraints are defined over them: one forbidding 
the co-existence of t\ with tz, and the other forbidding the co-existence of tz with ft. As observed in Example[3] the 
combination of these two constraints implicitly enforces the co-existence of t\ with t-^. Hence, the fact that t\ and tj, 
are not involved in the same (ground) constraint does not imply that these two tuples can be considered as independent 
from one another. 

However, it is easy to see that if two tuples are not connected through any path in the conflict hypergraph, assuming 
independence among them does not contradict the constraints in any way. Hence, a cautious way of incorporating the 
independence assumption in our framework is the following: any two tuples are independent from one another iff they 
belong to distinct maximal connected components of the conflict hypergraph. 

If this model is adopted, nothing changes in our characterization of the consistency checking problem. In fact, it 
is easy to see that an instance of cc is equivalent to an instance of the variant of cc where independence is assumed 
among maximal connected components of the conflict hypergraph. This trivially follows from the fact that, if a PDB 
D p is consistent according to the original framework, all the possible interpretations combining the models of the 
maximal connected components are themselves models of D p , and the set of these interpretations contains also the 
interpretation corresponding to assuming independence among the maximal connected components. 

As regards the query evaluation problem, adopting this variant of the framework makes qa #P-hard (as qa becomes 
more general than the problem of evaluating queries under the independence assumption lUllD . However, all our 
tractability results for projection-free queries still hold. In fact, the probability of t\, . . . ,t n as an answer of a query 
can be obtained as follows. First, the set T = {t\, . . . , t n \ is partitioned into the (non-empty) sets Si,..., St which 
correspond to distinct maximal connected components of the conflict hypergraph, and where each Si consists of 
all the tuples in T belonging to the connected component corresponding to S Then, the minimum and maximum 
probabilities of each S ; are computed (in PTIME, when our sufficient conditions for tractability hold), by considering 
each S , separately. Finally, the independence assumption among the tuples belonging to distinct maximal components 
is exploited, so that the minimum (resp., maximum) probability of t\,...,t„ is evaluated as the product of the so 
obtained minimum (resp., maximum) probabilities of S i , . . . , S k- 



7. Related work 

We separately discuss the related work in the AI and DB literature. 

AI setting. The works in the AI literature related to ours are mainly those dealing with probabilistic logic. The problem 
of integrating probabilities into logic was first addressed (though pretty informally) in [39]. Then, in II22I1 the PS AT 
problem was formalized as the satisfiability problem in a propositional fragment of the logic discussed in 1I39I1 . and 
shown to be Af-complete. In lfl7ll . a more general probabilistic propositional logic than that in [22] was defined, which 
enables algebraic relations to be specified among the probabilities of propositional formulas (such as "f/ie probability 
of <p\ A cpz is twice that of fa V ^4). ifTvIl mainly focuses on the satisfiability problem, showing that it is Af-complete 
(thus generalizing the result on PSAT of ll22ll ). However, it provides no tractability result (whose investigation is our 
main contribution in the study of the corresponding consistency problem). Up to our knowledge, most of the works 



devising techniques for efficiently solving the satisfiability problem (such as II27L 13411 ) rely on translating it into a 



Linear Programming instance and using some heuristics, which do not guarantee polynomial-bounded complexity. 



Thus, the only works determining provable polynomial cases of probabilistic satisfiability are [HIHU- As for 1221. 
we refer the reader to the discussions in Section |4](right after Definition [3} and at the begininning of Section |4~TI As 
regards [2], it is related to our work in that it showed that PSAT is tractable if the hypergraph of the formula (which 
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corresponds to our conflict hypergraph) is a hypertree. However, the notion of hypertree in Q is very restrictive, as it 
relies on a notion of acyclicity much less general than the y-acyclicity used here. In fact, even the simple hypergraph 
consisting of e\ - {t\, t-i, h), e2 = ih, h, t$\ is not viewed in [2] as a hypertree, since it contains at least one cycle, such 
as t\,e\, t2,e2, h,ei,t\ (note that, in our framework, this would not be a cycle). Basically, hypertrees in [01 are special 
cases of our hypertrees, as they require distinct hyperedges to have at most one node in common. Hence, our result 
strongly generalizes the forms of conflict hypergraphs over which cc turns out to be tractable according to the result 
of H on PSAT. 

The entailment problem (which corresponds to our query answering problem) was studied both in the preposi- 
tional [34] and in the (probabilistic-)logic-programming setting ll35il38l l37ll . The relationship between these works 
and ours is in the fact that they deal with knowledge bases where rules and facts can be associated with probabilities. 
Intuitively, imposing constraints over a PDB might be simulated by a probabilistic logic program, where tuples are 
encoded by (probabilistic) facts and constraints by (probabilistic) rules with probability 1 . However, not all the above- 
cited probabilistic-logic -programming frameworks can be used to simulate our framework: for instance, lliil l37ll use 
rules which can not express our constraints. On the contrary, the framework in ll35ll enables pretty general rules to 
be specified, that is conditional rules of the form (H\B)[pi, P2], where H and B are classical open formulas, stating 
that the probability of the formula H A B is between p\ and p2 times the probability of B. Obviously, any denial 
constraint ic can be written as a conditional rule of the form (H\true)[l, 1], where H is the open formula in ic. In 
the presence of conditional rules, [35] characterizes the complexity of the satisfiability and the entailment problems. 
The novelty of our contribution w.r.t. that of [35] derives from the specific database-oriented setting considered in our 
work. In particular, as regards the consistency problem, our tractable cases are definitely a new contribution, as ll35l 
does not determine polynomially-solvable instances. As regards the query answering problem, our contribution is 
relevant from several standpoints. First, we provide a lower bound of the membership problem by assuming that the 
database is consistent: this is a strong difference with [35], where the decisional version of the entailment problem has 
been addressed without assuming the satisfiability of the knowledge base, thus the satisfiability checking is used as a 
source of complexity when deciding the entailment. Second, we have characterized the lower bound of the member- 
ship problem w.r.t. two specific aspects, which make sense in a database-perspective and were not considered in [35]: 
the presence of projection in the query (Theorem [10} and the type of denial constraints (Theorem [111. Third, 03511 
did not prove any lower bound for the data complexity of the search version of the entailment problem. Indeed, it 
provided an FP^-hardness result only under combined complexity (assuming all the knowledge base as part of the 
input, while we consider constraints of fixed size) and exploiting the strong expressiveness of conditional rules, which 
enable also constraints not expressible by denial constraints to be specified. Hence, in brief, our Theorem [T2l shows 
that constraints simpler than conditional constraints suffice to get an FP NP ^° g '^ -hardness of the entailment for proba- 
bilistic logic programs, even under data complexity. Finally, our tractable cases of the query evaluation problem, up 
to our knowledge, are not subsumed by any result in the literature, and depict islands of tractability also for the more 
general entailment problem studied in ll35ll . 

DB setting. The database research literature contains several works addressing various aspects related to probabilistic 
data, and a number of models have been proposed for their representation and querying. In this section, we first 
summarize the most important results on probabilistic databases relying on the independence assumption (which, ob- 
viously, is somehow in contrast with allowing integrity constraints to be specified over the data, thus making these 
works marginally related to ours). Then, we focus our attention on other works, which are more related to ours as they 
allow some forms of correlations among data to be taken into account when representing and querying data. 

As regards the works relying on the independence assumption, the problem of efficiently evaluating (conjunctive) 
queries was first studied in ill ill , where it was shown that this problem is #P-hard in the general case of queries 
without self-joins, but can be solved in polynomial time for queries admitting a particular evaluation plan (namely, 
safe plan). Basically, a safe plan is obtained by suitably pushing the projection in the query expression, in order to 
extend the validity of the i ndep endence assumption also to the partial results of the query. The results of [ 1 1 ] were 
extended in iflQjfl^ [Til I24I kill . Specifically, in 13, a technique was presented for computing safe plans on disjoint- 
independent databases (where only tuples belonging to different buckets are considered as independent). In II 1 3 1 
and lldl . the dichotomy theorem of ill ill w as extended to deal with conjunctive queries with self-joins and unions 
of conjunctive queries, respectively. In [4l]], it was shown that a polynomial-time evaluation can be accomplished 
also with query plans with any join ordering (not only those orderings required by safe plans). Finally, in Q24||, a 
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technique was presented enabling the determination of efficient query plans even for queries admitting no safe plan 
(this is allowed by looking at the database instance to decide the most suitable query plan, rather than looking only at 
the database schema). 

The problem of dealing with probabilistic data when correlations are not known (and independence may not 
be assumed) was addressed in 113 ill . Here, an algebra for querying probabilistic data was introduced, as well as a 
system called ProbView, which supports the evaluation of algebraic expressions by returning answers associated with 
probability intervals. However, the query evaluation is based on an extensional semantics and no integrity constraints 
encoding domain knowledge were considered. 

One of the first works investigating a suitable model for representing correlations among probabilistic data is IG3I1 . 
where probabilistic c-tables were introduced. In this framework, whose rationale is also at the basis of the PDB 
MayBMS [28], correlations are expressed by associating tuples with boolean formulas on random variables, whose 
probability functions are represented in a table. However, in this approach, only one interpretation for the database is 
considered (the one deriving from assuming the random variables independent from one another), and it is not suitable 
for simulating the presence of integrity constraints on the data when the marginal probabilities of the tuples are known. 
Similar differences, such as that of assuming only one interpretation, hold between our framework and that at the basis 
of Trio IHO]], where incomplete and probabilistic data are modeled by combining the possibility of specifying buckets 
of tuples with the association of each tuple with its lineage (expressed as the set of tuples from which each tuple 
derived). In particular, in [ 1 ] an extension of Trio is proposed which aims at better managing the epistemic uncertainty 
(i.e., the information about uncertainty is itself incomplete). Here, the semantics of generalized uncertain databases 
is given in terms of a Dempster-Shafer mass distribution over the powerset of the possible worlds (this collapses to the 
case of a PDB with one probability distribution, if the mass distribution is defined over e very single possible world). 
Further approaches to representing rich correlations and querying the data are those in 1 43, 32, 26|, where correlations 
among data are represented according to some graphical models (such as PGMs, junction trees, AND/XOR trees). 
In these approaches, correlations are detected while data are generated and, in some sense, they are data themselves: 
the database consists of a graph representing correlations among events, so that the marginal distributions of tuples 
are not explicitly represented, but derive from the correlations encoded in the graph. This is a strong difference with 
our framework, where a PDB is a set of tuples associated with their marginal probabilities, and constraints can be 
imposed by domain experts with no need of taking part to the data-acquisition process. Moreover, in ll43l SSI, 



independence is assumed between tuples for which a correlation is not represented in the graph of correlations. On the 
contrary, our query evaluation model relies on a "cautious" paradigm, where no assumption is made between tuples 
not explicitly correlated by the constraints. In lfl2ll . the problem of evaluating queries over probabilistic views under 
integrity constraints (functional and inclusion dependencies) and in the presence of statistics on the cardinality of the 
source relations was considered. In this setting, when evaluating query answers and their probabilities, all the possible 
values of the attribute values of the original relations must be taken into account, and this backs the use of the Open 
World Assumption (as the original relations may contain attribute values which do not occur in the views). Under 
this assumption, queries are evaluated over the interpretation of the data having the maximum entropy among all the 
possible models. 

All the above-cited works assume that the correlations represented among the data are consistent. In l29ll . the 
problem was addressed of querying a PDB when integrity constraints are considered a posteriori, thus some possible 
worlds having non-zero probability under the independence assumption may turn out to be inconsistent. In this 
scenario, queries are still evaluated on the unique interpretation entailed by the independence assumption, but the 
possible worlds are assigned the probabilities conditioned to the fact that what entailed by the constraint is true. That 
is, in the presence of a constraint F, the probability P(Q) of a query Q is evaluated as P(Q\T), which is the probability 
of Q assuming that T holds. This corresponds to evaluating queries by augmenting them with the constraints, thus it is 
a different way of interpreting the constraints and queries from the semantics adopted in our paper, where constraints 
are applied on the database. The same spirit as this approach is at the basis of [9], where specific forms of integrity 
constraints in the special case of probabilistic XML data are taken into account by considering a single interpretation, 
conditioned on the constraints. 
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8. Conclusions and Future work 

We have addressed two fundamental problems dealing with PDBs in the presence of denial constraints: the con- 
sistency checking and the query evaluation problem. We have thoroughly studied the complexity of these problems, 
characterizing the general cases and pointing out several tractable cases. 

There exist a number of interesting directions for future work. First of all, the cautious querying paradigm will be 
extended to deal with further forms of constraints. This will allow for enriching the types of correlations which can be 
expressed among the data, and this may narrow the probability ranges associated with the answers (in fact, for queries 
involving tuples which are not involved in any denial constraint, the obtained probability ranges may be pretty large, 
and of limited interest for data analysis). 

Another interesting direction for future work is the identification of other tractable cases of the consistency check- 
ing and the query evaluation problems. As regards the consistency checking problem, we conjecture that polynomial- 
time strategies can be devised when the conflict hypergraph exhibits a limited degree of cyclicity (as a matter of fact, 
we have shown that this problem is feasible in linear time not only for hypertrees, but also for rings, which have 
limited cyclicity as well). A possible starting point is investigating the connection between the consistency checking 
problem (viewed as evaluating the (dual) lineage of the constraint query - see Remark 1) and the model checking 
problem of Boolean formulas. The connection between lineage evaluation and model checking has been well estab- 
lished mainly for the cases of tuple-independent PDBs B4Q. 12511 . In fact, in this setting, it has been shown that, as it 
happens for checking Boolean formulas, the probability of a lineage can be evaluated by compiling it into a Binary 
Decision Diagram - BDD 13611 . and then suitably processing the diagram. Specifically, if the lineage (or, equivalently, 
the Boolean formula to be checked) L can be compiled into a particular case of BDDs (such as Read-Once or Ordered 
BDD), the lineage evaluation (as well as the formula verification) can be accomplished as the result of a traversal of 
the BDD, in time linear w.r.t. the diagram size. Hence, in all the cases where L can be compiled into a Read-Once or 
an Ordered BDD of polynomial size, L can be evaluated in polynomial time. One of the most general result about the 
compilability of Boolean formulas into Ordered BDDs was stated in 1 19], where it was shown that any CNF expres- 
sion over n variables whose hypergraph of clauses has bounded treewidth (< k) admits an equivalent ordered BDD of 
size 0(n k+l ). Then, the point becomes devising a mechanism for exploiting an Ordered BDD equivalent to a Boolean 
formula / to evaluate the probability of /, when neither independence nor precise correlations can be assumed among 
the terms of /. Up to our knowledge, this topic has not been investigated yet, and we plan to address it in future 
work. If it turned out that, under no assumption on the way terms are correlated, the probability of formulas can be 
evaluated by traversing their equivalent Ordered BDDs, then the above-cited result of IU9I1 would imply other tractable 
cases of our consistency checking problem. However, our results on hypertrees and rings would be still of definite 
interest, as we have found that in these cases the consistency checking problem can be solved in linear time, while 
the construction of the ordered BDD is 0(n k+l ). Moreover, our results show that the consistency checking problem 
over hypertrees and rings is still polynomially solvable (actually, in quadratic time) in the case that the cardinality 
of hyperedges is not known to be bounded by constants (see the discussion right after Theorem [2), which does not 
always correspond to structures having bounded treewidth. 

Finally, our framework can be exploited to address the problem of repairing data and extracting reliable informa- 
tion from inconsistent PDBs. This research direction is somehow related to [3], where the evaluation of clean answers 
over deterministic databases which are inconsistent due to the presence of duplicates is accomplished by encoding 
the inconsistent database into a PDB adopting the bucket independent model. Basically, in this PDB, probabilities are 
assigned to tuples representing variants of the same tuple, and these variants are grouped in buckets. However, the so 
obtained PDB is consistent, thus this approach is not a repairing framework for inconsistent PDBs, but is a technique 
for getting clean answers over inconsistent deterministic databases after rewriting queries into "equivalent" queries 
over the corresponding consistent PDBs. A more general repairing problem in the probabilistic setting has been re- 
cently addressed in [33], where a strategy based on deleting tuples has been proposed, "inspired" by the common 
approaches for inconsistent deterministic databases J6J. We envision a different repairing paradigm, which addresses 
a source of inconsistency which is typical of the probabilistic setting: inconsistencies may arise from wrong assign- 
ments to the marginal probabilities of tuples, due to limitations of the model adopted for encoding uncertain data into 
probabilistic tuples. In this perspective, a repairing strategy based on properly updating the probabilities of the tuples 
(possibly by adapting frameworks for data repairing in the deterministic setting based on attribute updates 
seems to be the most suitable choice. 
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Appendix A. Proofs 

In this appendix we report the proofs of the theorems whose statement have been provided and commented in the 
main body of the paper. Furthermore, the appendix contains some new lemmas which are exploited in these proofs. 

Appendix A.l. Proofs of Theorem^ Proposition^ and Lemma\l\ 
Theorem[TJ (Complexity of cc) cc is NP-complete. 

Proof. The membership of cc in NP has been already proved in the core of the paper, where a reduction from cc to 
PSAT has been described. As regards the hardness, it follows from Theorem [7] (or, equivalently, from Theorem [5}, 



whose proof is given in Section Appendix A. 3 □ 

We now report a property of y-acyclic hypergraphs from lll5ll . which will be used in the proof of PropositionQ] 

Fact 1. lfl5ll Let H — (N, E) be a hypertree. There exists at least one hyperedge e e E such that at least one of the 
following conditions hold: 

1 . e fl N(i/~' e ') is a set of edge equivalent nodes; 

2. there exists e' e E such that e' + e and e n N(H~ le - e ' } ) = e' D N(//" |e ' c ' 1 ). 
Moreover, is still a hypertree. 

Proposition [TJ Let H — (N, E) be a hypertree. Then, there is at least one hyperedge e € E such that Int(e, H) is a 
matryoshka. Moreover, is still a hypertree. 

Proof. Reasoning by induction on the number of hyperedges in E, we prove that there is a total ordering e\, ■ ■ ■ , e„ of 
the edges in E such that all the following conditions hold for each i e [l..n]: 

1. either e,- n N(H~^ eu "' e w}) is a set of edge equivalent nodes, or there exists e' e E(H~^ eu "' e >*- 1 ') such that e' + e 
and e n N(H- ie - e ' ] ) = e' n N{H~^ e \, 

2. H~^ ew "' e " is a hypertree; 

3. Int(ei,H~( el ' "' ei -^) is a matryoshka. 

The base case (\E\ = 1) is straightforward. In order to prove the induction step, we reason as follows. Since H is 
a hypertree, Fact [TJ implies that there is a node e such that 1) either e n N(H~^) is a set of edge equivalent nodes, or 
there exists e' e E such that e' + e and e n N(H^ e ' e ' ] ) = e' n N(H' te - e ' ] ), and 2) H~ {e} is a hypertree. 

From the inductive hypothesis, since is a hypertree, there exists a total ordering e\, ■ ■ ■ , e n -\ of the nodes in 
E - {<?} such that for each i e [l..n - 1] conditions 1,2 and 3 are satisfied w.r.t. i/"' c '. 

If Int(e, H) is a matryoshka, then the total ordering e, e\, ■ ■ ■ , e„_i of the nodes in E satisfies conditions 1,2 and 3 
for every edge in the sequence thus the statement is proved in this case. 

Otherwise, since Int(e, H) is not a matryoshka then e C\N{H~^) is not a set of edge equivalent nodes. Hence, since 
e satisfies the conditions of Fact[JJthen there exists ej e [e\, ■ • ■ , e„_i } such that e n N{H~^ e ' e ^) = ej n N(H^ e ' ej ^). 

We now consider separately the following two cases: 

Case 1): there is k e [l.j - 1] such that e k n N(H- {e ' e <- "' • et - e J } ) = e } n N{H~ {e ' ei '- " 

Case 2): there is no Jk e [l.j - 1] such that e* n N(H^ e ' ei - = ej n N(H- {e ' e ^ "- e ^). 

We first prove Case 1). Let it e [l.j-1] be the smallest index such that e k nN(H- {e - ei '" = epNiH^ 1 '-^). 
We consider the total ordering of the edges of £ obtained by inserting e immediately before in ej, • • • ,e n -\, i.e., 
ei, • • • ,e k -i,e, e k ,--- ,ej, ■ ■ ■ ,e n -\. 

We first prove that for each i e [l..k- 1] conditions 1,2 and 3 still hold. For each i e 1] one of the following 

cases occur: 

• ei n ej = 0. In this case since e n N(H~^ e ' e ^) = ej n N(H~^ e ' e ^), it is straightforward to see that conditions 1 , 2 
and 3 hold. 
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• e t n ej + and e,- n }J{H~ {e ' eu "'' e ^) is a set of edge equivalent nodes. Since e n N(H^ e - e ' } ) = ej n N(H~ {e ' e i ] ), 
ei n ej + and ey is an edge of H^ e - ei '- - e >- l] then e t n N(H- {e ' e ^ ^ e - l] ) = e,- n N(H~ {ei '- Therefore, the 
nodes in e, n N(H^ ei, "' ,e '-^) are edge equivalent w.r.t /7 - tei>'"> e i-» too. Hence, conditions 1,2 and 3 hold. 

• eiDej + and there is an h e [i+l..n - 1], with/; + j, such that e,- n N(H~ {e ' ei <' ' e " e » ] ) = e h n N(H~ [e - eu ••• e '- e »'). 
Since, e ; - is and edge of and e n N(H- [e - e J ] ) = ej n N(H- {e ' e ^) it holds that e, n = 
e, n #(#-{«!"• . e i. e *J) = e/l n N(H- te, --' ei ' e " ) ). Hence conditions 1,2 and 3 hold in this case too. 

Observe that, in the last two cases mentioned above the fact that 7«f(e,, H^ el is a matryoshka follows from the 
fact that e t n N(H~ {el - "• e '-i>) = e,- n N(H~ [e - el - and e,- n e = e/ n e;. Moreover, conditions 1, 2 and 3 still hold for 
each i e [k..n - 1] since they are not changed w.r.t. the inductive hypothesis. 

As regards the edge e, it is easy to see that conditions 1 and 2 are satisfied since ej appears after e in the total 
ordering e\, • ■ • ,e\-\,e, e\, ■ • ■ ,ej, ■ ■ ■ ,e n -\. 

We now prove that condition 3 holds for e. We know from the induction hypothesis that Int(ek,H^ e,ew "' ek - 1 ') is a 
matryoshka. However, since e n N(H~ le ' e i ] ) = ejr\N(H- le ' e i ] ) and j > k then Int{e k ,H {e ' e> ' -• <? '-' 1 ) = Int{e k ,H {ei > - > e <>- i] ). 
Since, e k n N(H- {e ' e ''-' e "- e J ] ) = ej n N(H- {e ' e '' ~^ e A) and e n N{H- [e ' e > ] ) = e ; - n N(H- {e - e J ] ) it holds that 

einAr(ff- (e ' ei '-' et ^ } ) = e ; niV(H" {e,ei '"'- es ^ } ) = e niV(#" {e ' e ""' ,e * ,e -' } ). 

Therefore the set of nodes in e n N(H~^ e1, '■ et - 1 l) can be partitioned in three sets N,N',N" such that: 

- iV = niV(H-{^.- ^)) = UsEtoto.H'-,--^,^, 

- iV' = e k n ey - N, and 

- iV" = e n e ; - - N' -N. 

Hence, it is easy to see that/ref(e, = /«f( ejt , '■' e *-' , )U{MJAnu{NUN'UJV"}. Therefore, 7nf(e, 

is a matryoshka. Hence, the proof for Case 1) is completed. 

We now prove Case 2). We consider the total ordering of the edges of E obtained by inserting e immediately 
before e$ in e\, ■ ■ ■ , e„_i, i.e., e\, ■ • ■ , ej-i, e,ej, ■ ■ ■ , e„_i. It is easy to see that we can prove that for each / e - 1] 
conditions 1,2 and 3 are satisfied applying the same reasoning applied in Case 1) in order to prove that for each 
/ e [l..k- 1] conditions 1,2 and 3 hold. Analogously to the proof of Case 1) it is straightforward to see that conditions 
1,2 and 3 still hold for each i e [j..n - 1] since they are not changed w.r.t. the inductive hypothesis. 

As regards the edge e, it is easy to see that conditions 1 and 2 are satisfied since e ; appears after e in the total 
ordering e\, - ■■ ,e/-i, e,ej, ■ ■ ■ ,e„-i. 

To complete the proof we show that condition 3 holds for e in this case. From the induction hypothesis, we know 
that it is the case that Int(ej,H {e > eu ' > e j- l] ) is a matryoshka. However, since e n N(H~ {e ' e ' ] ) = ej n N(H~ le ' e i ] ) then 
e n N(H~ {e ' ew "' e i>) = ej n N(H' {e ' ew "' e i } ), and it holds that the set of nodes in e n N{H~ {eu '" < ek - l] ) can be partitioned in 
the sets and N' such that: 

~N = er-M/l :■•"••"•) = U.m^,//^- -,-■>) S, 

- N' = e k n ej - N. 

It is easy to see that the following holds Int(e, H [eu ~ • e '-' ') = Int(ej, H {e - ei > - ' e "-' } )U{NL)N' }. Therefore, Int(e, H {ei ' " ' e >- ,] ) 
is a matryoshka, which completes the proof for Case 2) and the proof of the proposition. □ 

Before providing the proof of LemmaQ] we report a well-known result on the minimum and maximum probability 
of the conjunction of events among which no correlation is known, taken from 

Fact 2. Let E\,Ei be a pair of events such that their marginal probabilities p(E\), p(E2) are known, while no 
correlation among them is known. Then, the minimum and maximum probabilities of the event E\ A E2 are as follows: 
p min {Ei A Ei) = max {0, p(E x ) + p(E 2 ) - 1 }; and p max (Ei A E 2 ) = min {p{E x ), p(E 2 )}. 
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The formulas reported above are also known as Frechet-Hoeffding formulas. In LemmaQ] we generalize the for- 
mula for the minimum probability, and adapt it to our database setting. 

Lemma[TJ Let D p be an instance ofD p consistent w.r.t. IC, T a set of tuples ofD p , and H — HG(D P , IC). If either i) 
the tuples in T are pairwise disconnected in H, or ii) Int(T, H) is a matryoshka, then p"""(T) — max {0, Yiter PO) ~ \T\ + 1 }• 
Otherwise, this formula provides a lower bound for p"""(T). 

Proof. Case z): In the case that t\,...,t„ are pairwise disconnected in the conflict hypergraph, the formula for 
p mm {t\, . . . , t n ) can be proved by induction on n, considering as base case the formula for the minimum probabil- 
ity of a pair of events reported in Fact [2] 

Case ii): We prove an equivalent formulation of the statement over the same instance of D p : "Let T be a set of nodes 
of H — HG(D P , IC) such that Int(T, H) is a matryoshka. Let T" — t\, . . . ,t n be a sequence consisting of the nodes 
of T ordered as follows: i > j => s(f,) 2 s(tj), where s(ti) is the maximal set in Int(T,H) containing f,-. Then, 
p mn (ti ,...,?„) = max Jo, Y!i=\ P(td - n + 1 } ". That is, we consider the nodes in T suitably ordered, as this will help 
us to reason inductively. 

We reason by induction on the length of the sequence T". The base case (« = 1) trivially holds, as, for any tuple f, 
p mm (t) = p(t). We now prove the induction step: we assume that the property holds for any sequence of the considered 
form of length n - 1, and prove that this implies that the property holds for sequences of n nodes. 

From induction hypothesis, we have that the property holds for the subsequence T"~ l = t\, . . . , £„_i of T". That 
is, there is a model M for D p w.r.t. IC such that £wa{j 1 ,... A _i}-M'( w ) = max{0, p{t t ) - (n - 1) + 1}. We show 
how, starting from M, a model M' can be constructed such that YiwD{ti t - max{0, YII= \ Pifd —n + 1), which is 
the formula reported in the statement for p mm (t\, . . . , t„). According to M, the set of possible worlds of D p can be 
partitioned into: 

• W (t i A • ■ • A f„_i A t„): the set of possible worlds containing all the tuples t\, . . . , t n -\, t n ; 

• W(-i(fi A ■ • ■ A f„_i) A t„): the set of possible worlds containing t„, but not containing at least one among 
ti, ■ ■ ■, t n -\\ 

• W (t i A • ■ • A f„_i , -if„): the set of possible worlds containing all the tuples ti,...,t n -u but not containing f„; 

• W(^{t\ A • • ■ A f„_i) A -if„): the set of possible worlds not containing t„ and not containing at least one tuple 
among h,...,t n -i. 

For the sake of brevity, the set of worlds defined above will be denoted as W, W, W", W", respectively. In 
the following, given a set of possible worlds "W, we denote as M^W) the overall probability assigned by M to 
the worlds in W, i.e., MCW) = 2„, eW M(w). Thus, if M(W) = max{0, £"=i p(t t ) - n + 1}, then we are done, 
since the right-hand side of this formula is the expression for p mm (t\ ,...,?„) given in the statement, and it is in 
every case a lower bound for p mn (t\, . . . , t„) (in fact, p mm (t\, . . . , t„) can not be less than the case that the tuples 
are pairwise disconnected in H). Otherwise, it must be the case that M(W) > maxjO, YIi=\ P( { i) - n + 1}. Assume 
that Yj1=\ P(ti) - n + 1 > (the case that max{0, YIi=\ P( l i) - n + 1} = can be proved similarly). Hence, we are 
in the case that M(W) = £"=i p(t t ) - n + 1 + e > 0, with e > 0. Since M(W') = p(t„) - M(W), this means that 
M(W) = p(t„) - lZ" =1 p(ti) - n + 1 + ej = - P(ti) + (n - 1) - e. From the induction hypothesis, the term 

- EEi 1 P(ti) + («-!) is equal to 1 - p min (t u . . . , f n _i), thus we have: M(W) = 1 - p min (t u . . . , t n -i) - e. Since 
p mm (h, . . . , f„_i) is exactly the overall probability, according to M, of the possible worlds containing all the tuples 

h t n -u we have that 1 - p min (t u . . . , f„_i) = M(W) + M(W"), thus we obtain: M(W) = M(W) + M(W") - e. 

This means that M(W") - e, where e > 0. That is, the overall probability of the possible worlds in W" is equal to 
the difference e between M(W) and the value Yj"=\ P( { i) —n + l that we want to obtain for the cumulative probability 
of the worlds in W. We now show how M can be modified in order to obtain a model M' such that M'{W) is exactly 
this value. We construct M' as follows. Let w"', . . . , w^" be the possible worlds in W" such that M(w'") > 0, for 
each i e [l..k]. Take k values e\,...,eu, where each e, is equal to M(w"')- Hence £* =1 e, = e. Then, for each 
i € [1.1], let M'iw'/') = M{w'l') - ei = 0, and, for each w'" e W" \ {w'f , . . . , w£'}, M'(w"') = 0. This way, 
M'(W") = Zw'ew" M'(w"') = M(W") - e = 0. For each w\" (with i € [l..k]), let w\ be the possible world in W 
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"corresponding" to w'": that is, w\ is the possible world w\" U {?„}■ The, for each i e [l..k], let M'(wJ) = M(wJ) + e,, 
and, for each w' e W \ {w\ , . . . , vt^}, M'(w') = M(w'). This way, M'(W') = £„.w M'(w') = M( W) + e. Basically, 
we are constructing the model M' by "moving" e from the overall probability assigned by M to the worlds of W" 
towards the worlds of W. Observe that every world w' i £ W such that M'(wJ) > is consistent w.r.t. IC, for the 
following reason. If M'(w') = M(w'.) the property derives from the fact that M is a model. Otherwise, we are in the 
case that w' i = w"' U {?„}, where M(w"') > 0. Since M is a model, M(w"') > implies that W!' is consistent w.r.t. IC. 
Then, adding t„ to W!' to obtain has no impact on the consistency: w' i does not contain at least one tuple among 
f i , . . . , t n -\, and from the fact that any hyperedge of HG(D P , IC) containing t„ contains all the tuples t\ , . . . , t n -\ no 
constraint encoded by the hyperedges containing t„ is fired in w' r 

It is easy to see that the strategy that we used to move e from the overall probability of W'" to W does not change 
the overall probabilities assigned to the tuples different from t„ in the worlds in W U W", but it changes the overall 
probability assigned to tuple t„ in the same worlds, as it is increased by e. Hence, to adjust this, we perform an 
analogous reasoning to "move" e from the overall probability M(W) (which is at least e and whose worlds contain 
t n ) to the overall probability assigned to W" (which contains the same worlds of W deprived of f„). Thus, we define 
M' by "moving" portions of e from the worlds of W to the corresponding worlds of W" (where the corresponding 
worlds are those having the same tuples except from t„), analogously to what done before from the worlds of W" to 
those of W. This way, we obtain that M'(W) = M(W) - e and M'(W") = M(W) + e. Also in this case, M' does not 
assign a non-zero probability to inconsistent worlds of W": for any w" such that M'(w") > M(w"), it is the case that 
M(wi) > (where w,- = w" U {f„}, which means that w, is consistent, and thus w" (which results from removing a 
tuple from w,) must be consistent as well (removing a tuple cannot fire any denial constraint). Finally, observe that 
this strategy for moving e from the cumulative probability of W to W" does not alter the marginal probabilities of the 
tuples different from t„ in these worlds. 

Therefore, M' is a model for D p w.r.t. IC which assigns to W a cumulative probability equal to M'(W) = 
M(W) - e= YIl = \ P(ti) - n + l, which ends the proof. □ 

Appendix A.2. Proof of Theorem\3\ 

In order to prove Theorem [3] we exploit a property that holds for particular conflict hypergraphs, called chains. 
Basically, a chain is the hypergraph resulting from removing a hyperedge from a ring. Thus, a chain consists of a 
sequence of hyperedges e \ , . . . , e„ where all and only the pairs of consecutive hyperedges have non-empty intersection 
(differently from the ring, e\ n e n = 0). 

Given a chain C = e\,...,e n , we say that n is its length, and denote it with length{C). Moreover, for each 
i e [l..n - 1], we will use the symbol ff,- to denote the intersection <?, n e,+i of consecutive hyperedges, and, for each 
i e [l..n], we will use the symbol yS, to denote ears(ei), and y8, to denote a subset of ears{ef). Finally, sub(C) will 
denote the subsequence e%, . . . , e„_i of the hyperedges in C. 

In the following, given a set of tuples X, we will use the term "event X" to denote the event that all the tuples in 
the set X co-exist. Furthermore, p'™"(E) will denote the minimum probability of the event E involving the tuples of 
the database D p when the conflict hypergraph contains only the hyperedges in H. 

Lemma 3. Let D p be a PDB instance ofD p such that D p |= IC Assume that HG(D P , IC) is the chain C = e\, . . . , e n 
(with n > I). Moreover, let f5\, f} n be subsets of the ears f3\, fi n of e\ and e n , respectively. Then: 

p™ n @, Ufa) = max{0, pf n 0i)+pf n 0n) ~ [l - pZI {C) ^i U (MA) U a„_i U (J3 n \Pn))]} 

where: p'™ {C) ( ai U (ft \&) U <*„_! U \A>)) = max {o, P^c^i U a^+p^i^i^Uifi^^S)-!} and, for any 
set of tuples y, pf\y) = max{0, Z,e r Pit) ~M + l}. 

Proof. p(J3\ U y6„) can be minimized as follows. 

1) We start from any model M of D p minimizing the portion of the probability space where neither the event ]}\ nor the 
event fi„ can occur. That is, M is any model minimizing the probability of the event E — a\ U (J3\ \fii) U a n _i U (J3„\/3„) 
(this event is mutually exclusive with both f3\ and /?„ due to hyperedges e\ and e„). It is easy to see that M is also a 
model for D p w.r.t. the conflict hypergraph sub(C), and that the minimum probability P^^qSE) of E w.r.t. sub(C) is 
equal to the minimum probability p™'"{E) of E w.r.t. C. We denote this probability as Y. 

32 



5. Flesca, F. Furfaro, F. Parisi / submitted to Journal of Computer and System Sciences 00 (2013) 1 A48\ 



33 



2) We re-distribute the tuples in 0i U fi n over the portion of size l-Y of the probability space not assigned to E, so 
that p{j3\) - p"™ifi\) and p0i) - p"™ifin), and with the aim of minimizing the intersection of the events f}\ and f}„. 
The fact that the events 0\ and j3 n can be simultaneously assigned their minimum probabilities p"™ifi\) and p"™(fi n ), 
respectively, derives from Lemma[T]and from the consistency of D p w.r.t. C. This yields a (possibly) new model M' 
for D p w.r.t. the "original" chain C where pifii U f3„) - max {o, p mi "(J3i) + p min (f3„) - [1 - T]} . In fact, viewing the 
available probability space as a segment of length l-Y, this corresponds to assigning the left-most part of the segment 
of length p mm 0{) to event fix, and the right-most part of length p mn n ) to event ft,,. This way, the probability of the 
intersection is the length of the segment portion (if any) assigned to both [5\ and /?„. In brief, we obtain the formula 
reported in the statement for p"""0i U y3„). 

The formula for p mm (ct\ U (J3i\/3\) U ff„_i U (J3„\/3„)) can be proved with an analogous reasoning, while the formula 
for p™"(y) follows from Lemma Q] □ 

Theorem|3j Given an instance D p of T> p , if HG(D P ,IC) = (N,E) is a ring, then D p |= IC iff both the following 
hold: 1) Ve e E, Z,ee Pit) < \e\ - 1; 2) Z,eN PiO ~ \N\ + \f 1 < 0. 

Proof. In the following, we will denote the ring HG(D P , IC) as fl = e\, . . . ,e„,e„ + \, and, for each i e [l..n + 1], 
the ears of <?,■ as e,, and, for each i e [l..n], the intersection e,- n e i+ i as y,, and e\ n e„+i as yo. Moreover, we will 
denote as C = e\, . . . , e n the chain obtained from ring % by removing the edge e n+ \. We now prove the left-to-right 
and right-to-left implications separately. 

(=>): We first show that, if D p \= IC and HG(D P ,IC) is a ring, then both Condition 1. and 2. hold. Condition 1. 
trivially follows from the fact that the proof of the left-to-right implication of Theorem [2] holds for general conflict 
hypergraphs. 

We now focus on Condition 2. As D p is consistent w.r.t. fi, the presence of hyperedge e n+ \ in HG(D P , IC) implies 
that the minimum probability that the tuples in e n +\ co-exist is equal to 0. That is, pS'"((yo U y„) U e„+i) = 0. On the 
other hand, p™'"((y U y„) U s n+x ) < p^"((y U y„) U e n+1 ), thus it must hold that /?™"((yo U y„) U e n+1 ) = 0. Since, 
according to the conflict hypergraph C, no correlation is imposed between the events (yo U y„) and e„+i, we also have 
that /?™"((yo U y n ) U e n+1 ) = max{0,/?™ ,n (yo U y„) + p"^"ie n+ i) - 1) (see Fact|2ji. Hence, the following inequality must 
hold: 

pf(7o U y„) + p'tis n+l ) - 1 < 0. (A. 1) 

We now show that inequality ( 1A.U entails that Condition 2. holds. First, observe that yo and y„ are subsets of the 
ears of e\ and <?„, respectively, w.r.t. the hypergraph C. Hence, since C is a chain, we can apply Lemma[3]to obtain 
p'c'"iyo U y„) in function of P™JJ (C) (yi U y«-i)- Thus, by recursively applying (|_|J times) Lemma|3] we obtain the 
following expression for p™"(yo U y„) (where x = [f J - 1 and y = f|l + 1): 

max Jo, max Jo, 2tey 

Pit) - lyol + lj + max (0, £, ey „ p(0 - |y„| + lj - 1+ 
max {0, max {o, max {0, £fe ri pit) - \ j\ I + 1 } + max {o, H/ey^, P(0 - lr«-i I + 1 } - 1 + 

max {O, max {o, Xtey x Pit) - \j x \ + l} + max (0, Z, err Pit) - \j y \ + l} - 1 + P) + 

max\0, 'Zte^Ue^) Pit) ~\S2 U S„_i|+ l} - 1 

}+ 

max{0, Zf6( £l u e „)P(0 - l £ i u l} - 1 

} 

where: 

p _( P™"i7x+i) if n is even; 

t /'"'"(r.v+i Uy y _i) otherwise. 

In this formula, pf"iy x+ x) = max{0, 2, ey _ t+1 p(0 - ly.v+il + 1 }, and p£ (y, +1 U y,.^) = max{0, 
2/e( y _ v+1 u yv _ 1 ) P(0 - l(r.v+i u Tv-i)l + 1} ( tne l atter follows from applying Lemma[T). 
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The value of p'™"(yo U y„) is greater than or equal to the sum S of the non-zero terms that occur in the expression 
obtained so far, that is: 

f 2 f E(iVW +I )M0-(l^|-|en + lD + | + l, 

if the length n of the chain C is even; 

Ww+oMO - (M-kf-il-tertaD+Lf J+ 1, 

if the length n of C is odd. 

The fact that p'™"(yo U y«) ^ 5 straightforwardly follows from that 5 is obtained by summing also possibly 
negative contributions of terms of the form p™"(Z) = p(0 _ |Z| + 1, which are not considered when evaluating 
Pc'"(yo u Tn)> since invocations of the max function return non-negative values only. 

As the number of edges in the ring % is \E\ — n + 1, the value of S is in every case greater than or equal to 



5 '= Z P(0-(\N\-\s n+l \) + 



/e(JV\e„ +1 ) 



\E\ 
2 



In brief, we have obtained S ' < S < p™ m (yo u 7n)- 

Since |= IC implies that p™"(y U y„) + p™"(e„+i) - 1 < (equation dA~TT >). we obtain 5 ' + p™"(e„+i) - 1 < 0. 
By replacing 5' and p™"(e„ + i) with the corresponding formulas, we obtain 



J] pit) - (\N\ - |e„ +1 |) + r^l + Z p(t) ~ " 



-\E\. 

»W - U«| - |£„+ilJ + | 

fE(JV\s„ + ] ) 

that is, Z ieNP ( t )-\N\ + \f]<0. 



(<=): We now prove the right-to-left implication, reasoning by contradiction. Assume that both Condition 1. and 2. 
hold, but D p is not consistent w.r.t. the conflict hypergraph 7?. However, since C is a hypertree and Condition 1 . holds, 
from Theorem[2]we have that D p is consistent w.r.t. the conflict hypergraph C. In particular, it must be the case that 
p™"(e„ + i) = p"™{{jq U y„) U b„+i) > 0: otherwise, any model of D p w.r.t. C assigning probability to the event 
(To U In) U e„+i would be also a model for D p w.r.t. which is in contrast with the contradiction hypothesis. 

Since, according to the conflict hypergraph C, no correlation is imposed between the events (yo U y„) and s„+i> 
we also have that /^'"((yo U y„) U e„+i) = max{0,p™"(y U y n ) + p™'"(s n+ i) - 1} (see Fact|2|. Hence, the following 
inequality must hold: 

p'™(yo U y n ) + pf"{s n+l ) - 1 > (A.2) 

which also implies both p™"(yo U y„) > and p™"(fi n +i) > (as probabilities values are bounded by 1). 
By applying Lemma|4] we obtain that p^ OT (yo U y n ) is equal to 

max{0,7^"'»(ro) + pf(y n ) - 1 + max{0, /Q c) (yi U r „-i) + pf(si U s„) - 1}} 

As shown above, p™'"(yo U y„) > 0, thus the expression for p™ m (yo U y„) can be simplified into: 

P't(7o) + pf n (7n) - 1 + max{0,//™ (c) ( ri U r „-i) + p™' n (£i U s n ) - 1} 

By replacing p™"(yo U y«) with this formula in equation dA.2l >. we obtain 

pfin) + P'tijn) + pf n {e n+l ) -2 + max{0, p^ Q (yi U y„_i) + p™"( £l U s„) - 1 } > (A.3) 

Since pT"(yo) + p™ n (y n ) + pT n (£n+i) — 2 < pT n (yo U y„ U s„+i) (which follows from applying twice Fact|2]i, and 
P™"(roUy„Ue„ +1 ) = max{0,£ pW - l(yoUy„Ue„ + i)| + 1}, and E/£(y uy„u£„ +1 )/?( f )-|(yoUy„Ue„ + i)| + 1 < 

(Condition 1 . over hyperedge e„+i)> we obtain that /5™"(yo)+p™"(y„)+/5™"(e„ + i)-2 < 0. Hence, the second argument 
of max in equation ( lA.3t must be strictly positive, thus equation dA.3t can be rewritten as: 

P7 n (yo) + ptiVn) + P7 n i£n + i) -2 + p'Z {C pi U y„-i) + P'tisi U e„) - 1 > (A.4) 
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where p^^pXyi u 7«-i) > and p^ m {s\ U e„) > (otherwise, the second argument of max in equation (1A.31 could 
not be strictly positive, being probability values bounded by 1). 

Observe that all the terms of the form p""" occurring in (1A.41 are strictly positive. In fact, we have already shown 
that this holds for pf n (e n+ i), p^ (C) {y\ U y„_i), and pf"{s\ U e„). As regards pf n (y ), the fact that it is strictly 
greater than derives from the p"™{yo) = p" m (jo) (which is due to Lemma [1] as yo is a matryoshka w.r.t. C), and 
p'c"{jo) - Pc' n (yo U 7n), where p™"(yo u Jn) > 0, as shown before. The same reasoning suffices to prove that 

pt"(yn)>o. 

The fact that all the terms of the form p™" in dA.41 > are strictly positive implies that we can replace them with the 
corresponding formulas given in Lemma [TJ simplified by eliminating the max operator. Therefore, we obtain: 



(Z, £r „ pit) - Irol + i) + (Z, er „ pit) - \y n \ + 1) + (E re£n+1 pit) - + i) + pZ^iyi u y„_i)+ 

+ (Zm« P(0 - M + l) + (Sa,. P(0 - + l) - 1 - 3 > 



(A.5) 



By recursively applying the same reasoning on P^L^iyi U y n -\) a number of times equal to [f J, the term on the 
left-hand side of equation ( IA.51 > can be shown to be less than or equal to YiteN P(0 - \N\ + \^f \ (depending on whether 
n is even or not, analogously to the proof of the inverse implication). Thus, we obtain Yuen 

pit) - \N\ + [f ] > 0, 

which contradicts Condition 2. □ 
Appendix A.3. Proofs of theoretns\4\\5\\6\ anc/|S] 

Theorem[4l If IC consists of a join-free denial constraint, then cc is in PTIME. In particular, D p \= IC iff, for each 
hyperedge e ofHG(D p , IC), it holds that YjteePiO < M — 1. 

Proof. Let IC consist of the denial constraint ic having the form: -*\R\{x\) A - ■ ■ AR m (x m ) A<pi(xi) A- • • A<p m (x m )], where 
no variable occurs in two distinct relation atoms of ic, and, for each built-in predicate occurring in 0i(j?i)A- ■ -A<p m (x m ) 
at least one term is a constant. Given an instance D p of &, we show that D p \= IC iff for each hyperedge e of 
HG(D p , IC), it holds that Z,eeP(t) < M - 1 . 

(=>): It straightforwardly follows for the fact that, as pointed out in the core of the paper after Theorem |2] the 
condition that, for each hyperedge e of HG(D P , IC), TjieeP(t) < |e| - 1 is a necessary condition for the consistency in 
the presence of any conflict hypergraph. 

(<=): For each i e [l..m], let be the maximal set of tuples in the instance of Rj such that every tuple f, 6 R^. 
satisfies Rj(xi) A <pi(x*i). 

It is easy to see that HG(D P , IC) consists of the set of hyperedges {{fj, . . . ,t m } \ V/ e [\..m\ ti e R^. Observe that 
not all the hyperdeges in HG(D P , IC) have size m, as the same relation scheme may appear several times in ic. That 
is, in the case that there are i, j € [l..m] with i < j such that R^ n R l p j + 0, the tuples f,- and tj occurring in the same 
hyperedge {t\, . . . , ti, . . . , tj, . . . , t m ] may coincide, thus this hyperedge has size less than m. 

From the hypothesis, it holds that, for every hyperedge e of HG(D P , IC), it must be the case that ^ tee p(t) < \e\ — 1 . 
Let e* be the hyperedge in HG(D P , IC) such that \e\ — 1 - Y*tee P(t) is the minimum, that is, 



e = argmin eEH G(DP,iQ 



M- l-J] MO 



For the sake of simplicity of presentation we consider the case that e* has size m, and denote its tuples as t\, . . . , t m . 
The generalization to the case that the size of e* is less than m is straightforward. 

Let S be a subset of D p . We denote with the subset of D p containing only the tuples in S . Let Pr e - be a model 
in M(D P ,,IC). Moreover, let t' v . . ., f n be the tuples in D p /e*. 

In the following, we will define a sequence of interpretations Pro, Pr\, . . ., Pr„ such that, for each i e [Q..n], Pr, is 
amodelinAt(£^ u{( , |i < i)> JC). 

We start by taking Pro equal to Pr e -. At the i th step we consider tuple t' j and define Pr, as follows: 

1. In the case that, for each j 6 [l..m], it holds that t' j t R^, we define, for each possible world w in pwd(e*U{t'j\j < 
i}), Pr,(w) = Pn^(w \ {?;}) • p(f.), iff,' e w, and Prfyv) = Pr^(w \ {?,}) ■ (1 - p(#), otherwise. 
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2. Otherwise, if there is j e [l..m] such that € R^., we consider the set J of all the indexes j e [l..m] such that 
£■ e Moreover, we denote with /?y the sum of the probabilities (computed according to Pr,_i) of all the 
possible worlds w e pwd(e* U < / - 1 }) such that, for each j e J, the corresponding tuple tj appearing in e* 

belongs also to W, i.e., pj = Zwepw^e'Uit'jU^-Dls.tMjeJtjewPn-liw). 

Then, for each possible world w in pwdle* U {f^.|y < /}), we define Pr, as follows: 

• Pr,(w) = Pr,_i(w - {f!}) • if t ' i e w and for each j 6 J it holds that fy E w, 

• Pr,(w) = Pr,-i(w - {r ( '}) • '" nA(0 '^~ p(f ' )) , if t\ £ w and for each j e 7 it holds that f / e w, 

• Pr,(w) = fr,_i(w), if f t t w and there is a j e J such that f,- £ w, 

• Pr,(w) = 0, otherwise. 

We prove that for each i e [Q..n] it holds that Pr, is a model in MiD^.^^.ylC) reasoning by induction on ;. 
The proof is straightforward for i = 0. We now prove the induction step, that is, we assume that Pr,_i is a model in 

MWeMfjUli-lV IC) ^ P1 ' 0Ve * at Pn 18 a m ° del 111 M( - D e*U{f } \j<i}> IC) - 

As regards the first case of the definition of fV, from Pr^i, it is easy to see that Pr t is a model in MiD^,^,^.^, IC) 

since Pr, consists in a trivial extension of Pr,_i which takes into account a tuple not correlated with the other tuples 
in the database. 

As regards the second case of the definition of Pr, from Pr,_i , it is easy to see that, if pj > p(f ; ') than Pr, guarantees 
that the condition about the marginal probabilities of all the tuples in e* U {t'p < i] holds. Moreover, Prj assigns zero 
probability to each possible world w such that w ^ IC, since, for each possible world w in pwdie* U < i}), there 
is no subset S of w such that for each i e [l..m] there is a tuple t e S such that t e R^. The latter follows from the 
induction hypothesis, which ensures that Pr,_i is a model in MiD 1 ^,^,^ ^, IC), and from the fact that Pr,- assigns 

non-zero probability to a possible world w in pwd(e* U {tf.\j < i}) containing t 1 . iff for each j e J it holds that tj € w. 
Specifically, it can not be the case that w contains, for each x e [l..m] such that x £ J a tuple t x e P^, as otherwise 
w — {f f } would satisfy all the conditions expressed in ic, and w — } would be assigned a non-zero probability by Pr,-_i, 
thus contradicting the induction hypothesis that Pr,_i is a model in MiD^,^,^. ^, IC). 

We now prove that pj > pit]). Reasoning by contradiction, assume that pj < p(t ■). From the definition of pj 
it follows that pj > p mm (Aj € jtj). Therefore, since p min (Aj B jtj) is equal to maxjo, Yijej P( f jO _ l-^l + 1/ ^ follows that 
p(t'i) > Tjjej P(tj) ~ W + 1- Consider the hyperdege e = {t x \t x e e* A x t J] U {f|}. From the definition of e* it follows 
that \e\ — 1 - p(t) - W\ - 1 - p(f)- The latter implies that 1 - p(^) > |/| - 2;ev from which it follows 
that Yijzj P(tj) ~ \A + 1 - Pi 1 which is a contradiction. Hence, we can conclude that, in this case Pr, is a model in 

This conclude the proof, as Pr„ is a model in M(D P , IC) and then D p |= IC. □ 

Theorem|5J There is an IC consisting of a non-join-free denial constraint of arity 3 such that cc is NP-hard. 

Proof. The reader is kindly requested to read this proof after that of Theorem^ as the construction used there will be 
exploited in the reasoning used below. 

We show that the reduction from 3-coloring to cc presented in the hardness proof of Theorem |7]can be rewritten 
to obtain an instance of cc where IC contains only a denial constraints having arity equal to 3. 

Let G = (N, E) be a 3-coloring instance. We construct an equivalent instance {DP, IC, D p ) of cc as follows: 

- TP consists of the probabilistic relation schemas R p (Node, Color, P) and R P 2 {Nodel, Node!, Colorl, Color!, P); 

- D p is the instance of D p consisting of the instances r 9 of R p , and r p of R p , defined as follows: 

• for each node « e N, and for each color c e {Red, Green, Blue}, r p contains the tuple (n, c, |); 

• for each edge («i,«2) S E, and for each color c e {Red, Green, Blue}, rf contains the tuple (n\, «2, c, c, 1); 
moreover, for each node n e N, and for each pair of distinct colors C\,C2 e {Red, Green, Blue}, r£ contains the 
tuple («, n, c\,c*i, 1); 
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- IC is the set of denial constraints over T) p consisting of the constraint: -\R\{xi, x 2 ) A/? 1(^3, xi)Ai?2(*i> X3, x 2 , X4)]. 

Basically, the constraint in IC imposes that adjacent nodes can not be assigned the same color, and the same node 
can not be assigned more than one color. 

Let (£)' , IC, D ) be the instance of cc defined in the hardness proof of Theorem Q] where it was shown that an 
instance G of 3-coloring is 3-colorable iff D 1 |= IC. It is easy to see that D p \= IC iff D l \= IC, which completes 
the proof. □ 



Theorem ®IfIC consists of a BEGD, then cc is in PTIME. 
Proof. Let the BEGD in IC be: 

ic = "■[■RiCf.fi) ARz&fz) Azi + z 2 \, 

where each z, (with i E {1,2}) is a variable in y\ or f 2 . That is, for the sake of presentation, we assume that the 
conjunction of built-in predicates in ic consists of one conjunct only (this yields no loss of generality, as it is easy 
to see that the reasoning used in the proof is still valid in the presence of more conjuncts). We consider two cases 
separately. 

Case 1 : Ri - R2, that is, only one relation name occurs in ic. Let X be the set of attributes in Attr{R\) corresponding to 
the variables in x, and let Z\ and Z 2 be the attributes in Attr(R\) corresponding to the variables z\ and z 2 , respectively. 
Let r be an instance of R\. 

It is easy to see that the conflict hypergraph HG(r, IC) is a graph having the following structure: for any pair of 
tuples tu t 2 , there is the edge (h,t 2 ) in HG(r, ic) iff: 1) VX e X, ti[X] = t 2 [X], and 2) h[Zi] + t 2 [Z 2 ]. 

This structure of the conflict hypergraph implies a partition of the tuples of r, where the tuples in each set of 
the partition share the same values of the attributes in X. Obviously, cc can be decided by considering these sets 
separately. 

For each set G of this partition, we reason as follows. Let Vc be the set of pairs of values {c\,c 2 ) occurring as 
values of attributes Z\ and Z 2 in at least one tuple of r (that is, Vq is the projection of r over Z\ and Z 2 ). For each 
pair {c\,c 2 ) e Vg, let T\c\,c 2 \ be the set of tuples in G such that, Vf € T[c\,c 2 ], t\Z{\ = c\ and t[Z 2 ] = c 2 . A first 
necessary condition for consistency is that there is no pair {c\,c 2 ) e Vq such that c\ + c 2 : otherwise, any tuple in 
T[ci, c 2 ] would not satisfy the constraint, thus it would not be possible to put it in any possible world with non-zero 
probability Straightforwardly, this condition is also sufficient if zi and z 2 belong to the same relation atom. Thus, in 
this case, the proof ends, as checking this condition can be done in polynomial time. 

Otherwise, if z\ and z 2 belong to different relation atoms and if the above-introduced necessary condition holds, 
we proceed as follows. From what said above, it must be the case that Vq contains only pairs of the form (c, c), 
and, correspondingly, all the sets T[c\,c 2 ] are of the form T[c,c]. For each T[c,c], let p(T[c, c]) be the maximum 
probability of the tuples in T[c,c] (i.e., p(T[c,c]) = max, e T[c,c]{p(t)}- Moreover, for each (c, c) e Vc, take the tuple 
t c in G such that p(t c ) = J>(T[c, c]), and let 7~g be the set of these tuples. We show that cc is true iff, for each G, the 
following inequality (which can be checked in polynomial time) holds: 

2 p(T[c,c])<l (A.6) 

MePa 

(=>): Reasoning by contradiction, assume that, for a group G, inequality (1A.6E does not hold, but there is a model 
for the PDB w.r.t. IC. 

The constraint entails that, for each pair of distinct tuples t\,t 2 e Tc, there is the edge (f 1 , t 2 ) in HG(r, IC). Hence, 
there is a clique in HG(r, ic) consisting of the tuples in Tc- Since the sum of the probabilities of the tuples in Tc is 
greater than 1 (by contradiction hypothesis), and since cc is true only if, for each clique in the conflict hypergraph, the 
sum of the probabilities in the clique does not exceed 1, it follows that cc is false. 

(<=): It is straightforward to see that there is model for Tq w.r.t. ic, since the sum of the probabilities of the tuples 
in To is less than or equal to 1, and since the tuples in Tc describe a clique in HG{Tq, ic). Since, for each (c, c) e <Pg, 



2 Obviously, we assume that there is no tuple with zero probability, as tuples with zero probability can be discarded from the database instance. 
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the tuple t c in 7~g is such that its probability is not less than the probability of every other tuple in T[c, c], it is easy to 
see that a model M for G w.r.t. ic can be obtained by putting the tuples in T[c, c] other than t c in the portion of the 
probability space occupied by the worlds containing t c . 

Case 2 : Ri + Ri. We assume that z\ e y\ and zi e yi, that is, two distinct relation names occur in ic, and the variables 
of the inequality predicate belongs to different relation atoms. In fact, the case that z\ and zi belong to the same 
relation atom can be proved by reasoning analogously. 

Let X\ and X2 be, respectively, the set of attributes in Attr{R\) and Attr(R2) corresponding to the variables in j?, 
and let Z\ and Z2 be the attributes in Attr{R\) and Attr{R2) corresponding to the variables zi and z 2 , respectively. Let 
r\ be the instance of R\, and r2 be the instance of R 2 . 

Observe that ic does not impose any condition between pairs of tuples t\ e r\ and t 2 e r2 such that there are 
attributes X\ e X\ and X2 € X2 such that t\[X{\ + t 2 \X 2 \ This entails that cc can be decided by considering the 
consistency of the tuples of r\ and r2 sharing the same combination of values for the attributes corresponding to the 
variables in x separately from the tuples sharing different combinations of values for the same attributes. For each 
combination v = vi, . . . , v* of values for these attributes (i.e., Vv € IIj (r{) n (r 2 J), let Gi(v) and G 2 (v) be the sets 
of tuples of r\ and r 2 , respectively, where the attributes corresponding to the variables in .xhave values vi, . . ., v*. Let 
«V(Gi(iO) = {f[Z,] 1 1 € Gi(v)} and r V(G 2 (y)) = [t[Z 2 ] 1 1 s G 2 (v)}. For each a e WnW)) (resp., c 2 e r V(G 2 (v))), let 
Ti[ci] (resp., T^fe]) be the set of tuples t of Gj(v) (resp., G2(v)) such that t[Z\] = c\ (resp., t\Z{\ - c-i). Moreover, for 
each c\ € < V{G\(v)) (resp., C2 e r V(Gz(^))), let p(ri[ci]) (resp., p(Tz[c2\)) be the maximum probability of the tuples 
in Ji[ci] (resp., T 2 [c 2 J). 

We show that cc is true iff, Vv e Hy (ri) n IIj (r2), it is the case that: 

Vci e*V(Gi(iO)VcaeT(G2(iO)s.tci ^ c 2 , it holds that p{T x [ci\) +p(T 2 [c 2 ]) < 1 (A.7) 

(=>): Reasoning by contradiction, assume that the database is consistent but there are C] € r V{G\(v)) and c 2 e 
*y(G2(v)), with ci ^ C2, such that p(J\[c\\) + p(T2[c2]) > 1. Hence, there are tuples fj e Titcj] and t% e ?2[C2] such 
that p{t\) + p{t2) > 1. As these tuples form a conflicting set, the conflict hypergraph HG(D P , IC) contains the edge 
(fi, t%). It follows that the condition of Theorem|2] that is a necessary condition for the consistency in the presence 
of any hypergraph (as pointed out in the core of the paper after Theorem |2}, is not satisfied, thus contradicting the 
hypothesis. 

(<=): It suffices to separately consider each v e n^ (ri) n W^Jjz), and to show that the fact that ( 1A.7I ) holds for 
this v implies the consistency of the tuples in G\{v) U G2O?) (as explained above, the consistency can be checked by 
separately considering the various combinations in Hj (ri) n IIj {tq)). 

LetTJ e Gi(v) and T 2 e G 2 (v) be such that 

(i) t\ € ri[ci] and t 2 e 72[c2], with c\ + C2\ and 

(ii) among the pair of tuples satisfying the above conditions, t\ and t 2 have maximum probability w.r.t. the tuples 
in G\{v) and G 2 {v), respectively. 

If these two tuples do not exist, it means that the set of tuples G\(y) U G2(v) is consistent, as there are no tuples 
coinciding in the values of the attributes corresponding to x, but not in the attributes corresponding to z\ and z 2 - It 
remains to be proved that, if these two tuples exist, then the tuples in G\(y) U G 2 (\t) are consistent w.r.t. IC. In fact, 
equation ( IA.7t ensures that p(ti) + pih) < 1, which in turn entails that a model for {fi,?2} w.r.t. IC exists. Starting 
from this model, a model M for G\(y) U G2(v) w.r.t. IC can be obtained as follows. The tuples in G\(v) other than 
t\ which are conflicting with at least one tuple G2(y) are put in the portion of the probability space occupied by the 
worlds containing t\. This can be done since the fact that t\ has maximum probability among the tuples in G\(y) 
conflicting with at least one tuple in G 2 ($) makes any other tuple in Gi(v) conflicting with at least one tuple in G2(v) 
have a probability which fits the portion of the probability space occupied by t\. Similarly, the tuples in G 2 (y) other 
than ?2 which are conflicting with at least one tuple G\{v) are put in the portion of the probability space occupied by 
the worlds containing t 2 . Also in this case, this can be done since t 2 has maximum probability among the tuples in 
G2(v) conflicting with at least one tuple in G\{v). Finally, any tuple in G\{v) (resp., G2(v)) which is conflicting with 
no tuple in G 2 {v) (resp., G\(y)) can be put in any portion of the probability space, since its co-occurrence with any 
other tuple makes no constraint violated. □ 
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Theorem[7j There is an IC consisting of 2 FDs over the same relation scheme such that cc is NP-hard. 

Proof. We show a LOGSPACE reduction from 3-coloring to cc which yields cc instances where IC contains only 
functional dependencies. The rationale of the proof is similar to the proof in II22I1 of the A^P-hardness of PSAT. 

We briefly recall the definition of 3-coloring. An instance of 3-coloring consists of a graph G = (N, E), where N 
is a set of node identifiers and £ is a set of edges (pairs of node identifiers). The answer of a 3-coloring instance is 
true iff there is a total function / : N — > {Red, Green, Blue] such that /(«,) + f(nj) whenever {«/, ni] € E (f is said to 
be a 3-coloring function over G). 

Let G = (N, E) be a 3-coloring instance. We construct an equivalent instance (D p , IC, D p ) of cc as follows: 

- T> p consists of the probabilistic relation schema R p (Node, Color, IdEdge, P); 

- D p is the instance of D p consisting of the instance r p of R p defined as follows: for each node n E N, for each edge 
e e E such that nee, and for each color c e \Red,Green,Blue), r p contains the tuple (n, c, e, 4). 

- IC is the set of denial constraints over T> p consisting of the following two functional dependencies: 

icy : -i[R(xi, X2, X3) A R(x\, X4, X5) A xi + X4] 
ic2 : -i[R(x\,X2, X3) A R(x4, X2, X3) A x\ + X4] 

We first show that, if G is 3-colorable, then D p \= IC. In fact, given a 3-coloring function / over G, the interpre- 
tation Pr defined below is a model of D p w.r.t. IC. Pr assigns non zero probability to the following three possible 
worlds only: 

w\ = {R(n, f(n), e) \ n € N, e e E A n € e); 

w 2 = {R(n, Next(f(n)), e) \ n e N, e e E A n e e}; 

vv 3 = [Rin, Next(Next(f(n))), e) \ n € N, e € E A n e e), 

where Next is a function which receives a color c e {Red, Green, Blue] and returns the next color in the sequence {Red, 
Green, Blue] (where Next(Blue) returns Red). Specifically, Pr assigns probability | to all the three possible worlds 
w\, W2, wj. It is easy to see that each possible world w\, W2, W3 satisfies IC and that every tuple in D p appears exactly 
in one possible world in {w\ , W2, W3}. Therefore Pr is a model of D p . 

We now show that, if D p \= IC, then G is 3-colorable. It is easy to see that G is 3-colorable if there is a model Pr 
for D p w.r.t. IC having the following property II: Pr assigns non-zero probability only to 3-coloring possible worlds, 
i.e., possible worlds containing, for each edge e = (n,, nj) e E, two tuples = Rifli, c,, e) and f. = R(nj, cj, e), where 
c, + Cj. In fact, starting from Pr and a 3-coloring possible world w with Pr(w) > 0, a function f can be defined which 
assigns to each node n e N the color c if there is a tuple R(n, c, e) € w (/"' is a function since it is injective, as w cannot 
contain tuples assigning different colors to the same node). Clearly, f is a 3-coloring function, as it associates every 
node n with a unique color and assigns different colors to pairs of nodes connected by an edge. Hence, it remains 
to be shown that at least one model satisfying II exists. In fact, we prove that any model for D p w.r.t. IC satisfies 
II. Reasoning by contradiction, assume that, for a model Pr, there is a non-3-coloring possible world w* such that 
Pr(w*) = e > 0. That is, there is at least a pair n, e, with n e N and e e E such that for each c e {Red, Green, Blue), 
R(n, c, e) i w*. Now, consider the tuples t\ = R(n,Red, e), ?2 = R(n,Green, e), £3 = R(n,Blue, e) and the sets 
S 1 = {w € pwd(D p ) I h 6 w A Pr(w) > 0}, 
5 2 = {w e pwd(D p ) 1 1 2 e w A Pr(w) > 0}, 
S 3 = |we pwd(D p ) | ? 3 e w A Priyv) > 0}. 

Since ic\ is satisfied by every possible world w e pwd(D p ) such that Pr(w) > 0, this means that for each possible 
world w there is at most one color c e {Red,Green,Blue} such that the tuple R(n, c, e) belongs to w. Therefore, it must 
be the case that, V/, j e {1,2, 3), i + j, S O S j = 0. Since Pr is an interpretation, the following equalities must hold: 

• j = P(h) = ZweSj Pr(w); 

• 5 = P(t 2 ) = E W es 2 Pr(w); 

• \ = Pih) = IZu-es, PKw). 
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This implies that 

^ Pr(w) + V Pr(yv) + V Pr(w) = 1 

However, since Pr(w*) — e > and Pr is an interpretation, Yjwepwd(Di>)\{w'} P r ( w ) < !• The latter, since w* £ 5, 
for each z e {1,2,3}, implies that pwd(D p ) \ {w*} 3 Si US? U S3, and then Yjwe(s ] us^us 3 )P r ( w ) < 1 which is a 
contradiction. □ 

Theorem|8j Lef eac/i denial constraint in IC be join-free or a BEGD. If, for each pair of distinct constraints ic\,ici 
in IC, the relation names occurring in ic\ are distinct from those in ic2, then cc is in PTIME. 

Proof. Trivially follows from theorems |6] H] and from the fact that the consistency can be checked by considering the 
maximal connected components of the conflict hypergraph separately. □ 

Theorem[9j If IC consists of one FD per relation, then HG(D P , IC) is a graph where each connected component is 
either a singleton or a complete multipartite graph. Moreover, D p is consistent w.r.t. IC iff the following property 
holds: for each connected component C of HG(D P , IC), denoting the maximal independent sets ofCasS\,...,Sk, it 
is the case that 2ieri,.£] Pi - h where pi — max /e s f p(t). 

Proof. It is easy to see that multiple FDs over distinct relations involve disjoint sets of tuples. Thus, it is straightfor- 
ward to see that the conflict hypergraph has the structural property described in the statement iff, for each relation, 
the conflict hypergraph over the set of tuples of this relation is a graph having the same structural property. More- 
over, as observed in the proof of Theorem [8] the consistency can be checked by considering the maximal connected 
components of the conflict hypergraph separately. 

This implies that, in order to prove the statement, it suffices to consider the case that that IC consists of a unique 
FD ic over a relation R, and D p consists of an instance r of R. In particular, we assume that ic is of the form: 

~^[R(x,yi) A R(x,y 2 ) A Z\ * z 2 ], 

where zi and zi are variables in y\ and y 2 , respectively, corresponding to the same attribute Z of R. That is, we are 
assuming that the FD ic is in canonical form (i.e., its right-hand side consists of a unique attribute). This yields no 
loss of generality, as it is easy to see that the reasoning used in the proof is still valid in the presence of FDs whose 
right-hand sides contain more than one attribute. 

The relation instance r can be partitioned into the two relations r' , r", containing the tuples connected to at least 
another tuple in HG(D P , IC) (that is, tuples belonging to some conflicting set) and the isolated tuples (that is, tuples 
belonging to no conflicting set), respectively. Obviously, the subgraph of HG(D P , IC) containing only the tuples in 
r" contains no edge, and it is such that each of its connected component is a singleton. Therefore, in order to complete 
the proof of the first part of the statement, it remains to be proved that the subgraph G of HG(D P , IC) containing only 
the tuples in r' is such that each of its connected component is a complete multipartite graph. 

Let X be the set of attributes in Attr(R) corresponding to the variables in x. The form of ic implies that G is a graph 
having the following structural property S: for any pair of tuples t\, ti, there is the edge (t\, fc) in G iff: 1) VX e X, 
fiffl =? 2 [X],and2)?i[Z] *h\_Z\- 

This implies that G has as many connected components as the cardinality of H^r'. Specifically, each connected 
component of G corresponds to a tuple v in FL^r', as it contains every tuple of r' whose projection over X coincides 
with v. In fact, property >S implies that: 

A. there is no path in G between tuples differing in at least one attribute in X; 

B. any two tuples t', t" coinciding in all the attributes in X are either directly connected to one another (in the 
case that they do not coincide in attribute Z), or there is a third tuple f" to which they are both connected. In 
fact, t' and t" are not isolated (otherwise they would not belong to r'), and any tuple conflicting with t' is also 
conflicting with t", as we are in the case that t' and t" coincide in Z. 

To complete the proof of the first part of the statement, we now show that, taken any connected component C of 
G, C is a complete multipartite graph. This straightforwardly follows from the following facts: 
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a. the nodes of C can be partitioned into the maximal independent sets S i, . . . , where k is the number of distinct 
values of attribute Z occurring in the tuples in C. In particular, each 5, corresponds to one of these values v of Z, 
and contains all the tuples of C having v as value of attribute Z. The fact that every S , is a maximal independent 
set trivially follows from property S. 

b. for every pair of tuples t ,■ and tj belonging to S , and S j (with i , j e [ 1 . .k] and i + f), there is an edge connecting 
ti to tj (this also trivially follows from property S). 

We now prove the second part of the statement. 

(=>): Reasoning by contradiction, assume that D p is consistent w.r.t. IC but, for some connected component C of 
HG(D P , IC), it does not hold that £i6[i..fc] pi < 1, where pi = max, e j ; p(t) and S\, ... ,Sk are the maximal independent 
sets of C. Obviously, C can not be a singleton (otherwise the inequality would hold), thus it must be the case that C is 
a complete multipartite graph. 

For each i e [l..k], let f,- be a tuple of 5, such that p(t,) - pi. Since C is a complete multipartite graph, and since 
the so obtained tuples t \ , . . . , \ belong to distinct independent sets, it must be the case that, for each i,j e [ 1 . .k] with 
i + j, there is an edge in C between f, and tj. This means that, in every model M for D p w.r.t. IC, for each i, j e [l..k] 
with i + j, the tuples f,-, tj can not co-exist in a non-zero probability possible world. That is, every non-zero probability 
possible world contains at most one tuple among those in ft, ... , fjt}. This entails that the sum of the probabilities of 
the possible worlds containing the tuples in \t\, . . ., %} is equal to the sum of the marginal probabilities of the tuples 
in {fi,.. ., tk), which, by contradiction hypothesis, is greater than 1. This contradicts the fact that M is a model. 

(<=): We now show that D p is consistent w.r.t. IC if the inequality Dieti..*] pi < 1 holds, where pi = max,^, p(t) 
and S\,...,Sk are the maximal independent sets of C. Consider the database instance D p consisting of the tuples 
t\, . . . , h where f, (with i e [L.fc]) is a tuple of 5, such that p(ti) = pi. It is easy to see that there is a model for D p 
w.r.t. IC: since C is a complete multipartite graph, and t\, . . . belong to distinct independent sets of C, it follows 
that, for each i,j e [l..k] with i + j, there is exactly one edge in C between f,- and tj. That is, the conflict graph of 
D p w.r.t. IC is a clique. Hence, the fact that inequality Eferi ..*] pi < 1 holds is sufficient to ensure the existence of a 
model M for D p w.r.t. IC. Starting from M, a model for D p w.r.t. IC can be obtained by reasoning as follows. Since, 
for each maximal independent set S , of C (with i e [1..&]), the tuples in Si other than f,- are such that their probability 
is less than or equal to p(f,), a model M for D p w.r.t. IC can be obtained by putting the tuples in S , other than f, in the 
portion of the probability space corresponding to that occupied by the worlds containing tj according the model M. 
The fact that M is a model follows from the fact that, for each i e [l..k], the tuples in 5, other than f, are conflicting 
only with the same tuples which are conflicting with f,. □ 

Appendix A.4. Proofs of Lemma\2\and Theorem \ll\ 

Lemma |U Let Q be a conjunctive query over D p , D p an instance of D p , and t an answer of Q having minimum 
probability p mm and maximum probability p max . Let m be the number of tuples in D p plus 3 and a be the maximum 
among the numerators and denominators of the probabilities of the tuples in D p . Then p"" n and p max are expressible 
as fractions of the form |, with < tj < (ma) m and < 5 < (ma)'". 

Proof. Consider the equivalent form of the linear programming problem LP(S*) described in the proof of Proposi- 
tion|2] where equalities (el) of S *(D P , IC, D p ) are rewritten as: 
Vf 6 D p , d(p(t)) X 2i| W ,6pwd(DP)Af 6W) Vi = d(p(t)) x p(t), 

where p(t) — "Jf^ (i.e., n(p(t)) and d(p(t)) are the numerator and denominator of p(t), respectively). This way, 
we have that all the coefficients of S*(D P , IC, D p ) are integers, where each coefficient can be either 0, or 1, or the 
numerator or the denominator of the marginal probability of a tuple of D p . 

In 14211 . it was shown that the solution of any instance of the linear programming problem with integer coefficients 
is expressible as a fraction of the form 'j, where both rj and 6 are naturals bounded by (ma)"', where m is the number 
of (in)equalities and a the greatest integer coefficient occurring in the instance. By applying this result to LP(S*), we 
get the statement: in fact, it is easy to see that i) S *(T) P , IC, D p ) contains integer coefficients only, ii) the number m 
of (in)equalities in S *(D P ,IC, D p ) is equal to the number of tuples in D p plus 3, and Hi) the greatest integer constant 
a in S*(D P ,IC,D P ) is the maximum among the numerators and denominators of the probabilities of the tuples in 
D p . " □ 
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Theorem [TTJ (Lower bound of mp) There is at least one conjunctive query without projection for which mp is coNP- 
hard, even if IC consists of binary constraints only. 

Proof. We show a reduction from the planar 3 -coloring problem to the complement of the membership problem (mp). 
An instance of planar 3-coloring consists of a planar graph G = (N, E), where N isa set of node identifiers and E is 
a set of edges (pairs of node identifiers). The answer of a planar 3-coloring instance G is true iff there is a 3-coloring 
function over G, i.e., a total function f : N ^> {R,G,B} such that /(«,) + f(nj) whenever {«;,«;} e E. Observe 
that every planar graph G = (N, E) is 4-colorable, that is, there exists a function f : N —* [R,G, B, C] such that 
f(nd ^ f( n j) whenever {«,, nj\ e E (in this case, / is said to be a 4-coloring function). 

Let G = (N, E) be a planar 3-coloring instance. We construct an equivalent mp instance (D p , IC, D p , Q, t, kuki) 
as follows: 

- T> p consists of the probabilistic relation schemas R P c (Node, Color, IdEdge, P) and R p (Tid,P). 

- D p is the instance of T> p consisting of the instances r P c of R P G and r| of R^ defined as follows: 

- for each node n e N and for each edge e e E such that n 6 e, r P c contains four tuples of the form R P c {n, c, e, |), 
one for each c e {R, G, B, C}; 

- r p consists of the tuples R^(l, |) and R P ) (2, 4j only; 

- IC contains the following binary denial constraints: 
icy : ->[Rc(x\, x%, xy) A Rc(x\, X4, x$) A xi + X4]; 

iC2 : -<[Rg{Xi, X%, Xy) A Rc(X4, X2, Xy) A X\ £ X4]; 

W3 : -i[R c (xi,X2,xy) A R$(2)]; 
W4 : -*[Rg(x\,X2,C) AR^(l)]; 

- Q(x,y) = R^x) A R+(y); 

- t = (1,2); 

- h = \\ 

- k 2 = 1. 

It is easy to see that the fact that G is 4-colorable implies that D p is consistent w.r.t. IC (it suffices to follow the 
same reasoning as the proof of Theorem[TJ using 4 colors instead of 3). 

We first prove that, if G is 3-colorable, then the corresponding instance of mp is true. Let / be a 3-coloring function 
over G. Consider an interpretation Pr for D p which assigns non-zero probability to the following possible worlds only: 
Wl = {R c ( n ,f(n),e) \n e N,e e E An € e} U {R^l)} 
w 2 - {Rc(n, Next( f(n)), e) \ n € N, e € E A n € e) 
w 3 = {R c (n, Next(Next(f(n))), e) \ n € N, e € E A n € <?} 
W4 = {R c (n,Next(Next(Next(f(n)))),e) \ neN,eeE A nee] 
w 5 = {^(1),^(2)J 
w 6 = {R^2)} 

where Next is a function which receives a color c e {R, G, B, C] and returns the next color in the sequence [R, G, B, C] 
(where Next(C) returns R). Furthermore, Pr assigns probability | to the possible worlds w\,w 2 ,W3,W4 and w$, and 
probability | to the possible world we- It is easy to see that Pr is a model of D p w.r.t. IC and the probability that the 
tuple t - (1, 2) is an answer of Q assigned by Pr is 4, Hence, the mp is true in this case (as 4 < k\). 

We now prove that if G is not 3-colorable, then the corresponding instance of mp is false. First observe that, 
reasoning similarly to in the proof of Theorem[T] it is possible to show that, for each model Pr of D p w.r.t. IC and for 
each possible world w such that Pr(w) > 0, if w contains at least a tuple of rrj, then for each node n e N and for each 
edge e e E such that nee, there exists c € {R, G, B, C] such that w contains the tuple Rein, c, e). This is due to the 
fact that every possible world w such that Pr(w) > can not contain two tuples Rein, c' , e), Rein, c" , e) and no tuple 
in rc can belong to a possible world which contains the tuple R,p(2) too. 
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Since G is not 3-colorable, for each model Pr of D p w.r.t. IC and for each possible world w such that Pr(w) > 
containing at least a tuple of rc, it holds that w contains a tuple Rc(n, C, e). This implies that no possible world 
containing a tuple of rc can contain the tuple /^(l), as otherwise icn would be violated. Since ic\ and ics hold for 
Pr, then the sum of the probability of the possible worlds containing at least a tuple of rc is equal to i . Since the 
possible worlds containing at least a tuple of rc cannot contain neither R$(l) nor ^(2) (as icn holds) and both R^il) 
nor/?^(2) has probability i it holds that the probability that both and R$(2) are true is j. The latter implies that 
the minimum probability that t = (1, 2) is an answer of Q is I, which is equal to k\. Therefore the mp is false if G is 
not 3-colorable. □ 

Appendix A.5. Proof of Theorem \12\ 

Theorem[l2l (qa complexity) qa belongs to FP NP and is FP NP[1 ° s " ] -hard. 

Proof. The membership in FP NP follows from 113511 - where it was shown that a problem more general than ours (that is, 
the entailment problem for probabilistic logic programs with conditional rules) belongs to FP NP (see Related Work). 
We prove the hardness for FP" p V°&"i by showing a reduction to qa from the well-known fP Ar/>[log " ] -hard problem 
clique size, that is the problem of determining the size K* of the largest clique of a given graph. 

Let the graph G = (N, E) be an instance of clique size, where u\,...,u n are the nodes of G (where n = \N\). 
We construct an equivalent instance (D P ,IC,D P , Q) of qa as follows. D p is the database schema consisting of the 
following relation schemas: Node p (Id, P), NoEdge 1 ' (nodeldx, nodeldi, P), Flag p (Id, P). The database instance D p 
consists of the following relation instances. Relation node 1 ' contains a tuple f, = Node p (uj, V«) for each node m, of G 
(that is, every node of G corresponds to a tuple of node 1 ' having probability '/«)■ Relation noEdge 1 ' contains a tuple 
NoEdge p (iij, Uj, 1) for each pair of distinct nodes of G which are not connected by means of any edge in E (thus, 
noEdge 1 ' represents the complement of E, and all of its tuples have probability 1). Finally, relation^ag' 7 contains the 
unique tuple Flag p (l, — ). 

Let IC consist of the following denial constraints over 3D 1 ': 

ic\ : -i[Node(x\) A Node(x2) A NoEdge(x\,X2)] 
ic2 ■ -i[Node(x\) A Node(x2) A Flag(Y) A x\ + X2] 

Basically, constraint ic\ forbids that tuples representing distinct nodes co-exist if they are not connected by any 
edge, while /q imposes that tuple Flag(l) can co-exist with at most one tuple representing a node. 

To complete the definition of the instance of qa, we define the (boolean) query Q() -Flag(l)ANode(x). 

We will show that the size of the largest clique of G is K* iff the empty tuple t® is an answer of Q over D p with 
minimum probability f = ^=f - (i.e.,Ans(Q,D p ,IC) consists of the pair (t 9 , [p min ,p" wx ]), with p mi " = *=f-). 

We first show that if G contains a clique of size K, then p mm < In fact, if K is the size of a clique C of G, then 
we can construct the following model M for D p w.r.t. IC. Let w c = {Node(ui)\ui e C)U noEdge, = { Flag(l)}U 
noEdge, and, for each w, e N\ C, w, = {Node(iij), Flag(l)}U noEdge. Then, denoting as w the generic possible world, 
the model M is defined as follows: 



Af(w) 



Vn if w = w c ; 

l /n if w = Wi, for i s.t. it, e N \ C; 

(K-V/n if w — wf; 

otherwise 



It is easy to see that M is a model. First of all, it assigns non-zero probability only to possible worlds satisfying 
the constraints. Moreover, for any tuple t in D p , summing the probabilities of the possible worlds containing t results 
in t[P]. In fact, considering only the possible worlds which have been assigned a non-zero probability by M, every 
tuple Node(ui) representing a node m, e C belongs only to w°, which is assigned by M the probability L /« (the same as 
p(Node(iij))). Analogously, every tuple Node(uj) representing a node w, t C belongs only to w', which is assigned by 
M the probability l /n (the same as p{Node{ui))). Finally, tuple Flag(X) occurs only in and inn - K possible worlds 
of the form w,, thus the sum of the probabilities of the possible worlds containing Flag(l) is M(w^) + (n - K) ■ - = 
^ = p(Flag(l)). 
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It is easy to see that the probability of the answer t% of Q over the model M is the sum of the probabilities of the 
possible worlds of the form w,-, that is ( "~ K> . Hence, from definition of minimum probability, it holds that p' 1 "" < ( "~ A:) . 

To complete the proof, it suffices to show that the following property f holds over any model M' for D p w.r.t. 
IC: "the probability I of the answer fg of Q over M' can not be strictly less than 1* — , Observe that, for 

every model M' , the possible worlds which have been assigned a non-zero probability by M' can be of three types 
(we do not consider noEdge tuples, as they have probability 1 , thus they belong to every non-zero-probability possible 
world): 

Type 1: world not containing Flag(l), and containing a non-empty set of tuples representing the nodes of a clique 
(the non-emptiness of this set derives from the combination of constraint ici with the value of the marginal 
probability assigned to tuple Flag(l)); 

Type 2: world containing the tuple Flag(l) and exactly one node tuple; 

Type 3: world containing the tuple Flag(l) only. 

We will show that property f holds over any model M' by reasoning inductively on the number x of possible worlds 
of Type 1 which have been assigned a non-zero probability by M'. 

The base case is x — 1, meaning that M' assigns probability l /n to a unique Type-1 world w[ l , and probability to 
all the other possible worlds of the same type. It is easy to see that the sum of the probabilities assigned by M' to the 
Type-2 worlds (which coincides with /) is equal to - • (n - \C T Y X |), where Cj is the clique represented by w^ 1 . Hence, 
if it were / < P, it would hold that \ -{n - |C['|) < {j}z f^, which means that \C\ l \ > K*, thus contradicting that K* is 
the size of the maximum clique of G. 

We now prove the induction step. The induction hypothesis is that P holds over any model assigning non-zero 
probability to exactly x - 1 Type-1 possible worlds (with x — 1 > 1). We show that this implies that f holds also over 
any model assigning non-zero probability to exactly x Type-1 possible worlds. Consider a model M' assigning non- 
zero probability to exactly x Type-1 possible worlds, namely w^ 1 , . . . , w T x l . We assume that these worlds are ordered 
by their cardinality (in descending order), and denote as C, the clique represented by wj l (with i e [l.jc]). We also 
denote as w T2 , . . . , w 72 the Type-2 possible worlds (where wj 2 contains the node tuple representing «,•). Moreover, let 
I' be the probability of the answer t% of Q over M' . We show that, starting from M', a new model M" for D p w.r.t. IC 
can be constructed such that: 

M" assigns non-zero probability to x - 1 Type-1 possible worlds; 
if) the probability /" of the answer true of Q over M" satisfies /" < /'. 

Specifically, M" is defined as follows. M" coincides with M' on all the Type-1 worlds except for the probabilities 
assigned to w T1 and w Tl . In particular, M"(w^ 1 ) = M'{w[ l ) + M'(vvJ'), while M"{w T x x ) = 0. Moreover, for each 
Type-2 world wj 2 such that u t e C\ \ C x , M"(wJ 2 ) = M'(wJ 2 ) - M'(w"), and, for each Type-2 world wj 2 such that 
u i e C x \ C\, M"{w T1 ) = M'{w T2 ) + M'{w T K l ). On the remaining Type-2 worlds, M" is set equal to M'. Finally, 
denoting the type-3 world as w T \ M"(w T3 ) = M'(w n ) -\C x \Ci\- M'{C X ) + \d \ C x \ ■ M'(C X ). In brief, M" is 
obtained from M' by moving the probability assigned to w T x l to w[', and re-assigning the probabilities of the Type-2 
and Type-3 worlds accordingly. Hence, it is easy to see that M" is still a model (as it can be easily checked that 
it makes the sum of the probabilities of the possible worlds containing a tuple equal to the marginal probability of 
the tuple). Moreover, property i) holds, as M" assigns probability to the world w T x x , while the other worlds of 
the form wj 1 (with i < x) are still assigned by M" a positive probability, and the remaining Type-1 worlds are still 
assigned probability 0. Also property if) holds, since the probability of true as answer of Q over M" is given by 
I" =l' + \C X \Ci\- M'(C X ) - |Ci \ C x \ ■ M'(C X ). Since \C\ \ > \C X \, and thus \d \C X \> \C X \ C\ |, I" is less than or equal 
to I'. If it were /' < P (and thus /" < P) M" would be a model assigning non-zero probability to x - 1 Type-1 possible 
worlds such that the answer true of Q over M" has probability strictly less than P, thus contradicting the induction 
hypothesis. □ 

Appendix A.6. Proof of Theorem U 3\ 

The proof of Theorem[l3]is postponed to the end of this section, after introducing some preliminary lemmas. 

Lemma 4. Let D p be a PDB instance of Tf such that HG(D P , IC) is a graph and D p |= IC. Let t, f be two tuples 
connected by exactly one path in HG(D P , IC). Then, p"""(t A f) and p max (t A f) can be computed in polynomial time 
w.r.t. the size of D p . 
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Proof. Let n be the path connecting t and t' in HG(D P , IC). It is easy to see that the fact that n is unique implies that 
p mi "(t a f) = pf n (t A t') and p max (t A t') = p™ x (t A f') (in fact, any model for D p w.r.t. HG(D P , IC) can be obtained by 
refining a model for D p w.r.t. 7T without changing the probabilities assigned to the event t A t' , following a reasoning 
analogous to that used in the proof of the right-to-left implication of Theorem[2]i. 

Since the path n connecting t and t' in the graph HG(D P , IC) is unique, it does not contain cycles (otherwise there 
would be at least two paths between t and ?')■ Hence, n is a chain in a graph (the definition of chain for hypergraph 
is introduced in Section Appendix A.2i. Therefore, p"™(t A f) can be determined by exploiting Lemma [3] which 



pT(tAt')-- 



p™ x (tAf) = 



provides the formula for computing the minimum probability that the ears at the endpoints of a chain co-exist. It 
is trivial to see that, denoting as i and ? the tuples connected to t and t' in n, in our case the formula in Lemma [3] 
becomes: 

0, if (t, f) is an edge of n 
max{0, p(t)+ p(t')- [1 - p™"(? A?')]|, otherwise. 

since n is a chain in a graph, thus its intermediate edges are hyperedges of cardinality 2 with no ears. 
As regards p'™ x (t A t'), it can be evaluated as follows: 
0, if (t, t') is an edge of n 
mm{p(t),p(t'), l-[p(i)+p(i')-p™ x (iAi')]}, 
otherwise. 

In fact, it is easy to see that the maximum probability of the event t A ?' is min{p(f), p(t'), p™ ax (^i A —if')}, where 
/?™ flx (-if A -if) is the maximum probability that both the tuples i and ? (which are mutually exclusive with t and t' , 
respectively) are false. In turn, p™ ax (^i A-.?) = 1 - p'™(i V P) = 1 - [p(t) +p(i r ) -p'™ x (i A ?)], thus proving the 
above-reported formula. 

We complete the proof by observing that p m '"(t A t') and p max (t A t') can be computed in polynomial time w.r.t. 
the size of D p by recursively applying the above-reported formulas for p mm and p max starting from t and t' , and going 
further on towards the center of the unique path connecting t and t ' . □ 

Lemma 5. For projection-free queries, qa is in PTIME if HG(D P ', IC) is a clique. 

Proof. It straightforwardly follows from the fact that, for each pair of tuples f, f' in HG(D P , IC), it holds that p mm (t A 
f) = p"' ax (t A t') — 0. " □ 

Lemma 6. For projection-free queries, qa is in PTIME if HG(D P , IC) is a tree. 

Proof. Ans(Q, D p , IC) can be determined by first evaluating the answer r q of Q w.r.t. det(D p ), and then computing, 
for each t e r q , the minimum and maximum probabilities p mm and p max of t as answer of Q. Obviously, r q can be 
evaluated in polynomial time w.r.t. the size of D p , and the number of tuples in r q is polynomially bounded by the size 
ofZ>\ 

Observe that, every ground tuple t e r q derives from the conjunction of a set of tuples {t\,...,t n } in det(D p ). 
Thus, in order to prove the statement, it suffices to prove that, for each set [ti, . . . , t n } of tuples in det(D p ), computing 
p mm {t\ A • • • A t n ) and p max {t\ A • ■ • A f„) is feasible in polynomial time w.r.t. the size of D p . 

For the sake of clarity of presentation, we assume that HG(D P , IC) coincides with its own minimal spanning tree 
containing all the tuples in [t\, . . . , t„\. This means that each f,- (with i e [l..n]) is either a leaf node or occurs as 
intermediate node in the path connecting two other tuples in [t\, . . ., t„), and all the leaf nodes are in {t\, . . ., t„}. In 
fact, if this were not the case, it is straightforward to see that nothing would change in evaluating p mm (t\ A • ■ • A t„) 
and p max (t\ A • • • A t n ) if we disregarded the nodes of HG(D P , IC) which are not in {?[,...,?„} and do not belong to 
any path connecting some pair of nodes in jf i , . . . , ?„}. 

Before showing how p mm (t\ A • • • A t„) and p max (t\ A • ■ • A t„) can be computed, we introduce some notations. 
We say that a tuple t is a branching node of HG(D P , IC) iff the degree of t is greater than two. Moreover, a pair of 
tuples (f,, tj) is said to be an elementary pair of tuples of HG(D P , IC) if (/) each of f,- and tj is either in {t\ , . . . , t„) 
or a branching node, and (it) the path connecting f, to tj contains neither branching nodes nor tuples in {t\, . . . , t n ) as 
intermediate nodes. 

The set of the elementary pairs of tuples is denoted as EP 'hg(dp ,ic) ( we a l so use the short notation EP, when 
HG(D P , IC) is understood). Moreover, we denote the branching nodes of HG(D P , IC) which are not in {t\, ...,£„} as 
t n +\ , • • ■ , t n+m . Observe that m < n, as n is also greater than or equal to the number of leaves of HG(D P , IC). Finally, 
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we denote with B = {true, false] the boolean domain, with B n+m the set of all the tuples of n + m boolean values, and 
use the symbol a for tuples of n + m boolean values and the notation a[i] to indicate the value of the z'-th attribute of a. 

We will show that p mm (t\ A ■ • • A t n ) (resp., p max {t\ A • ■ • A t n j) is a solution of the following linear programming 
problem instance LP(t\ A • • • A t n , T> p , IC, D p ): 

minimize (resp., maximize) x a 

a£B" + "< | V('E[l..n] a[i]=true 

subject to S Oi A • • • A t n , D p , IC, D p ) 

where S(t\ A • ■ • A t„, T> p , IC, D p ) is the following system of linear inequalities: 



e B" + - | 



x ti,tj 



(A) 



= true A 
a[j~\ = true 



\/(ti,tj)€EP 

Vi e[l..n+m] 



f\ti A tj) < X thtj < p max (ti A tj) (B) 



TjaeB" 
TjaeB" 



' \a[i]=true A a 
,X a = 1 



P(ti) 



(C) 
(D) 



Therein: (/) x tu , is a variable representing the probability that f, and tj coexist; and (ii) Va 6 B n+m , x a is a variable 
representing the probability that V/ € [\..n + m] the truth value of f, is a[i]; that is, x a is the probability of the event 

f\i\a[i]=true h ^ /\i\a[i]=false ~ X U- 

Since HG(D P ,IC) is a tree, Lemma |4] ensures that, for each (f,-,f ; ) e EP, p'"' n (ti A tj) and p max (tj A tj) can be 
computed in polynomial time w.r.t. the size of D p . Therefore, we assume that they are precomputed constants in 
LP{t\ A • • ■ A f„, D p , IC, D p ). 

It is easy to see that LP(t\ A • • ■ A t„,D p ,IC,D p ) can be solved in polynomial time w.r.t. the size of D p , as it 
consists of at most 6n - 2 inequalities using 2 2 "~' + In — 1 variables, and n only depends on the number of relations 
appearing in Q (we recall that we are addressing data complexity, thus queries are of constant arity). 

We now show that, for each solution of S(ti A ■ • ■ A t n ,D p , IC, D p ), there is a model Pr of D p w.r.t. IC such that 
p(t i A ■ • ■ A f„) w.r.t. Pr is equal to 2 a e B" +m \ x a> an d v i ce versa. 

V; e [l..n] a[i] = true 

Given a solution <x of S (t \ A ■ • • A t n , T) p ,IC, D p ), for each a e B n+m we denote with cr a the value assumed by the 
variable x a in cr; moreover, for each (f,-, tj) e EP we denote with cr tut . the value assumed by the variable x, i>f in cr. 

For each (f,, tj) 6 EP, we denote with Df >( . the maximal subset of D p which contains only tj, and the tuples 
along the path connecting f, and tj. 

From Proposition|2] the fact that, for each {t u tj) e EP, the value cr tutj is such that p""'"(f ; A tj) < cr w < p m '"(ti A tj), 
implies that there is at least a model Pr tut . of D 1 ,' , w.r.t. IC such that /?(?,■ A tj) w.r.t. Pr tutj is equal to cr,.,,.. For each 
(ti,tj) e isf, we consider a model Pr (iit , of D p t , t , w.r.t. JC such that p(f, A fy) w.r.t. Pr, itj is equal to cr f .^. Moreover, 
for each possible world w 6 pwd(D p t t ), we define the relative weight of w (and denote it by wr(w)) as: 



wr(w) 



Pr, ut! (w) 




Af,.Ew'Af,6W' P r ti,tj(w') 

1 




Pr,„ t ,(w) 




.A,t,ew' Mjtw' P r t„tj{w') 




Pr t „,,(w) 




Atiiw'Atiew' P r tj,t.{w') 

1 ' 




Pr tt , tl (w) 



Tjw'eD'^j.Miiw'Atjitw' P r ti,tj(w') 



if t/ e w A f j e w 



if e w A tj i. w 



if f,- £ w A ^ w 
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It is easy to see that, for each possible world w e pwd(D p ), there is for each pair (f,, tj) e EP a possible world 
w thtj 6 pwd(D p tit ) such that w = [J(ti,t )eEP w '„tp an d y i ce versa. 

We consider the interpretation Pr of D p defined as follows. For each possible world w e pwd(D p ), we consider 
the possible worlds w tutj such that w = UfeOsEP w t t ,tj an d define the interpretation Pr of D p as: 

Pr(w) = <x„ P| wr(w, ht) ), 

(t„ tj )eEP 

where a is the tuple in B" +m which agrees with w on the presence/absence of t\, • ■ ■ , t n+m (i.e., V; e [l..n + m] a[i] = true 
(resp. false) iff f, € w (resp. f, £ w)). It is easy to see that Pr is a model for D p w.r.t. JC. Specifically, the following 
conditions hold: 

• Pr assigns probability to every possible world w not satisfying IC. This can be proved reasoning by contra- 
diction. Assume that Pr(w) > and w does not satisfy IC. Consider the possible worlds w tu t- such that 

w = w th t r 

(ti,tj)eEP 

Since Pr(w) > 0, for each (?,, tj) e EP it holds that 

Pr tu tj(w, utj ) > 0. 

Hence, since Pr th ,. is a model of Df. t ., then w,. f; contains no pair of tuples t',t" connected by an edge in 
HG(D P , IC). Therefore, w contains no pair of tuples f', t" connected by an edge in HG(D P , IC), thus contra- 
dicting that w does not satisfy IC. 

• For each tuple t e D p , p(t) = Yjwepwd(DP)Mew Prty)- This follows from the fact that, given a tuple t e D p , and 
such that t belongs to a chain whose ends are the tuples f,, tj, the probability of a tuple t is given by 

2j Pr t „tj(w, ht j). 

The latter is equal to Yjwepwd(Di>)s.t.tew P r (w), since for each w th tj £ pwd(D P t ) it holds that 

2 JV(w) = Prt„tj(w t „tj). 

wepwd(DP) s.t. Wi h i-Qw 

Therefore, the interpretation Pr is a model for D p w.r.t. JC, and the probability assigned to f i A ■ ■ • , t„ by Pr is 
equal to £ „ 6 b»+<» cr a . Hence, it is easy to see that p mm {t\ A ■ • • A t n ) (resp. p max (t\ A • • • A t„)) is the optimal 

av; g = true 

solution of LP(fi A - • • At„, D P ,IC, D p ) and can be computed in polynomial time w.r.t. the size of D p , which completes 
the proof. □ 

Theorem H3l For projection-free queries, qa is in PTIME if HG(D P , IC) is a simple graph. 

Proof. Let t be an answer of the projection-free query Q posed on the deterministic version of D p . The minimum and 
maximum probabilities p mm and p max of t as answer of Q over D p can be determined as follows. Let T = [t\, . . .,t n ] 
be the set of tuples in D p such that Q(F) — t\ A ■ • • A t n . T can be partitioned into the sets T\, . . . , Tk, such that: 

1) k is the number of distinct (maximal) connected components of HG(D P , IC), each of which contains at least one 
tuple in T; 

2) for each i e [l..k], Ti contains the tuples of T belonging to the i-th maximal connected component of HG(D P , IC) 
among those mentioned in 1). 

Let f, be the conjunction of the tuples belonging to the partition T,- of T. Since every maximal connected component 
of HG(D P ,IC) is either a clique or a tree, lemmas [5] and |6] ensure that p mm (ii) and p max {U) can be computed in 
polynomial time w.r.t. the size of D p . As distinct tuples tj and f}, with i,j e [l..k], belong to distinct maximal 
connected components of HG(D P , IC), they can be viewed as events among which no correlation is known. Hence, 
p m '"(i) (resp., p max (t)) can be determined by applying Fact[2]to the events t\, . . . h, with the probability of f/ equal to 
p(tj) = p min {ti) (resp., p{tj) = p max (ti)), for each i e [l..k]. ' ~ □ 
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Appendix A.7. Extending tractable cases of query evaluation 

As discussed in the core of the paper (Section[6]l, our tractability result on query evaluation can be extended to the 
cases that: i) tuples are associated with ranges of probabilities, instead of exact probability values; ii) denial constraints 
are probabilistic. We here give a hint on how the proof of Lemma|6]can be extended to these cases (Lemma [6] states 
that projection-free queries can be evaluated in PTIME if the conflict hypergraph is a tree, and is the core of the proof 
of Theorem[T3Tl. 

As regards extension ii), it is easy to see that, as shown for cc, any instance / of the query evaluation problem in 
the presence of probabilistic constraints is equivalent to an instance I' of qa, where the conflict hypergraph H' of /' 
is obtained by augmenting each hyperedge of the conflict hypergraph H of I with an ear. The point is that, even if H 
is a tree, this reduction makes H' contain hyperedges with more than two nodes, thus H' is no more a tree. However, 
H' is a hypertree of a particular form: for any pairs of intersecting edges, their intersection consists of a unique node, 
which is a node inherited from H (the new nodes of H' are all ears). This implies that the minimum and maximum 
probabilities p mm and p max of an answer can be still computed as solutions of the two variants of the optimization 
problem LP(t\ A ■ • • A t„,D p , IC, D p ) introduced in the proof of Lemma|6] The fact that LP(ti A • • • A t n ,D p , IC, D p ) 
can be still written and solved in polynomial time derives from the fact that the values //"'"(?,■ A tj) and p mflv (f, A tj) 
occurring in the inequalities (B) can be still evaluated in polynomial time, by observing that both p mm (f, A tj) and 
p max (ti A tj) can be obtained by exploiting Lemma [3] for the minimum probability value, and an analogous result for 
the maximum probability value. Observe that this reasoning does not work (as is) for general hypertrees, as in this 
case we are not assured that the tuples composing the answer are in intersections between distinct pairs of hyperedges. 

As regards extension i), the minimum and maximum probabilities p mm and p max of an answer can be computed as 
solutions of the two variants of the optimization problem LP(t\ A • • • A t„, T> p ', IC, D p ) with the following changes: 

1) equalities (C) are replaced with pairs of inequalities imposing that, for each f,, its probability ranges between the 
minimum and maximum marginal probabilities of the range associated with f, in the PDB; 

2) the values p m '"(f, A tj) and p max (ti A tj) occurring in the inequalities (B) are evaluated by considering the minimum 
probabilities for the tuples along the path connecting f, and tj in the conflict tree. Moreover, when evaluating p mm (ti A 
tj), the minimum marginal probabilities for and tj are taken into account, while, for p max (tj A tj), we have to consider 
their maximum probabilities. Therein, the maximum probability of a tuple t is the minimum between the upper bound 
of the probability range of t, and the maximum probability value that t can have according to the conflict tree (this 
value is entailed by the tuples connected to t by direct edges: as implied by Theorem[2] the sum of the probabilities of 
two tuples connected through an edge must be less than or equal to 1). Intuitively enough, we consider the minimum 
probabilities for the intermediate tuples between f, and tj as this allows the greatest degree of freedom in distributing 
ti and tj in the probability space. 
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